arXiv:2512.00042v1 Announce Type: new 
Abstract: Multimodal reasoning has become a cornerstone of modern AI research. Standardized exam questions offer a uniquely rigorous testbed for such reasoning, providing structured visual contexts and verifiable answers. While recent progress has largely focused on algorithmic advances such as reinforcement learning (e.g., GRPO, DPO), the data centric foundations of vision language reasoning remain less explored.
  We show that supervised fine-tuning (SFT) with high-quality data can rival proprietary approaches. To this end, we compile a 161.4 million token multimodal dataset combining textbook question-solution pairs, curriculum aligned diagrams, and contextual materials, and fine-tune Qwen-2.5VL-32B using an optimized reasoning syntax (QMSA). The resulting model achieves 78.6% accuracy, only 1.0% below Gemini 2.0 Flash, on our newly released benchmark YKSUniform, which standardizes 1,854 multimodal exam questions across 309 curriculum topics.
  Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning. This work establishes a data centric framework for advancing open weight vision language models, demonstrating that carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near state-of-the-art performance.

تسلط دراسة حديثة الضوء على إمكانيات الضبط الدقيق المعتمد على البيانات لتحسين نماذج اللغة المرئية (VLM) لأسئلة الامتحانات الموحدة، حيث حقق نموذج Qwen-2.5VL-32B دقة تصل إلى 78.6%. تعتمد هذه الطريقة على مجموعة بيانات متعددة الوسائط تتكون من 161.4 مليون رمز، تجمع بين أزواج الأسئلة والأجوبة من الكتب الدراسية والمواد السياقية، لتحسين قدرات الاستدلال.

Un estudio reciente destaca el potencial del ajuste fino centrado en datos para mejorar los modelos de lenguaje visual (VLM) en preguntas de exámenes estandarizados, logrando una precisión del 78.6 % con el modelo Qwen-2.5VL-32B. Este enfoque utiliza un conjunto de datos multimodal completo de 161.4 millones de tokens, combinando pares de preguntas y respuestas de libros de texto y materiales contextuales, para mejorar las capacidades de razonamiento.

Une étude récente met en lumière le potentiel de l'affinage centré sur les données pour améliorer les modèles de langage visuel (VLM) pour les questions d'examen standardisées, atteignant une précision de 78,6 % avec le modèle Qwen-2.5VL-32B. Cette approche utilise un ensemble de données multimodal complet de 161,4 millions de tokens, combinant des paires de questions-réponses de manuels scolaires et des matériaux contextuels, pour améliorer les capacités de raisonnement.

A recent study highlights the potential of data-centric fine-tuning in enhancing vision language models (VLMs) for standardized exam questions, achieving 78.6% accuracy with the Qwen-2.5VL-32B model. This approach utilizes a comprehensive multimodal dataset of 161.4 million tokens, combining textbook question-solution pairs and contextual materials, to improve reasoning capabilities.

Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

One More Thing in AI – Your Shortcut to AI Mastery

Closing the Gap: Data-Centric Fine-Tuning of Vision Language Models for the Standardized Exam Questions

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

The Visualizer

LangWatch

Cogent

Linkjob AI

Ready to build your own newsroom?