arXiv:2510.13795v3 Announce Type: replace 
Abstract: Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.

تم تقديم مجموعة بيانات جديدة تُدعى Honey-Data-15M، تحتوي على 15 مليون زوج من الأسئلة والأجوبة، لتحسين جودة نماذج اللغة متعددة الوسائط المفتوحة بالكامل (MLLMs). تهدف هذه المجموعة، إلى جانب خط أنابيب HoneyPipe وإطار DataStudio، إلى معالجة الفجوة الحالية في جودة البيانات التي تعيق أداء MLLMs المفتوحة مقارنة بالنماذج المملوكة. لقد حقق نموذج Bee-8B المدرب على هذه المجموعة أداءً رائدًا، مما يمثل تقدمًا كبيرًا في هذا المجال.

Se ha introducido un nuevo conjunto de datos llamado Honey-Data-15M, que contiene 15 millones de pares de preguntas y respuestas, para mejorar la calidad de los modelos de lenguaje multimodal completamente abiertos (MLLMs). Este conjunto de datos, junto con el pipeline de curación HoneyPipe y el marco DataStudio, busca abordar la brecha de calidad de datos que obstaculiza el rendimiento de los MLLMs abiertos en comparación con los modelos propietarios. El modelo Bee-8B entrenado en este conjunto de datos ha alcanzado un rendimiento de vanguardia, marcando un avance significativo en el campo.

Un nouveau jeu de données appelé Honey-Data-15M, contenant 15 millions de paires QA, a été introduit pour améliorer la qualité des modèles de langage multimodal entièrement ouverts (MLLMs). Ce jeu de données, associé au pipeline de curation HoneyPipe et au cadre DataStudio, vise à combler l'écart de qualité des données qui freine les performances des MLLMs ouverts par rapport aux modèles propriétaires. Le modèle Bee-8B, entraîné sur ce jeu de données, a atteint des performances de pointe, marquant une avancée significative dans le domaine.

A new dataset called Honey-Data-15M, containing 15 million QA pairs, has been introduced to enhance the quality of fully open multimodal large language models (MLLMs). This dataset, along with the HoneyPipe curation pipeline and DataStudio framework, aims to address the current data quality gap that hinders the performance of open MLLMs compared to proprietary models. The Bee-8B model trained on this dataset has achieved state-of-the-art performance, marking a significant advancement in the field.

Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs

Was this article worth reading? Share it

Ready to build your own newsroom?