arXiv:2511.16334v4 Announce Type: replace-cross 
Abstract: Recent advancements in large reasoning models have fueled growing interest in extending such capabilities to multimodal domains. However, despite notable progress in visual reasoning, the lack of transparent and reproducible data curation and training strategies remains a major barrier to scalable research. In this work, we introduce OpenMMReasoner, a fully transparent two-stage recipe for multimodal reasoning spanning supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we construct an 874K-sample cold-start dataset with rigorous step-by-step validation, providing a strong foundation for reasoning capabilities. The subsequent RL stage leverages a 74K-sample dataset across diverse domains to further sharpen and stabilize these abilities, resulting in a more robust and efficient learning process. Extensive evaluations demonstrate that our training recipe not only surpasses strong baselines but also highlights the critical role of data quality and training design in shaping multimodal reasoning performance. Notably, our method achieves a 11.6% improvement over the Qwen2.5-VL-7B-Instruct baseline across nine multimodal reasoning benchmarks, establishing a solid empirical foundation for future large-scale multimodal reasoning research. We open-sourced all our codes, pipeline, and data at https://github.com/EvolvingLMMs-Lab/OpenMMReasoner.

تم تقديم OpenMMReasoner كإطار تدريب جديد يهدف إلى تعزيز قدرات التفكير متعدد الوسائط في نماذج الذكاء الاصطناعي. يستخدم هذا الإطار عملية من مرحلتين تشمل الضبط الدقيق تحت الإشراف والتعلم المعزز، مستفيدًا من مجموعة بيانات كبيرة لتحسين قدرات التفكير عبر مجالات متنوعة.

Se ha presentado OpenMMReasoner como un nuevo marco de entrenamiento destinado a mejorar las capacidades de razonamiento multimodal en modelos de IA. Este marco emplea un proceso de dos etapas que incluye ajuste fino supervisado y aprendizaje por refuerzo, utilizando un conjunto de datos considerable para mejorar las habilidades de razonamiento en diversos dominios.

OpenMMReasoner a été introduit comme un nouveau cadre de formation visant à améliorer les capacités de raisonnement multimodal dans les modèles d'IA. Ce cadre utilise un processus en deux étapes qui comprend un ajustement fin supervisé et un apprentissage par renforcement, s'appuyant sur un ensemble de données substantiel pour améliorer les capacités de raisonnement dans divers domaines.

OpenMMReasoner has been introduced as a new training framework aimed at enhancing multimodal reasoning capabilities in AI models. This framework employs a two-stage process that includes supervised fine-tuning and reinforcement learning, utilizing a substantial dataset to improve reasoning abilities across various domains.

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

arXiv:2512.06663v1 Announce Type: new 
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable success in a broad range of vision-language tasks, such as general visual question answering and optical character recognition (OCR). However, their performance on perception-centric tasks -- such as object detection, semantic segmentation, and depth estimation -- remains significantly inferior to that of task-specific expert models. For example, Qwen2.5-VL-7B-Instruct achieves only 19% mAP on COCO2017 val, particularly struggling with dense scenes and small object recall. In this work, we introduce Chain-of-Thought for Detection (CoT4Det), a simple but efficient strategy that reformulates perception tasks into three interpretable steps: classification, counting, and grounding -- each more naturally aligned with the reasoning capabilities of LVLMs. Extensive experiments demonstrate that our method significantly improves perception performance without compromising general vision language capabilities. With a standard Qwen2.5-VL-7B-Instruct, CoT4Det boosts mAP from 19.0% to 33.0% on COCO2017 val and achieves competitive results across a variety of perception benchmarks, outperforming baselines by +2% on RefCOCO series and 19% on Flickr30k entities.

تهدف مقدمة CoT4Det، وهو إطار تفكير متسلسل، إلى تحسين أداء نماذج اللغة والرؤية الكبيرة (LVLM) في المهام الموجهة نحو الإدراك مثل اكتشاف الكائنات والتجزئة الدلالية، والتي كانت سابقًا متخلفة عن النماذج المتخصصة في المهام. يعيد هذا الإطار صياغة هذه المهام في ثلاث خطوات قابلة للتفسير: التصنيف، العد، والتثبيت.

La introducción de CoT4Det, un marco de pensamiento encadenado, tiene como objetivo mejorar el rendimiento de los Modelos de Lenguaje-Visión de Gran Escala (LVLM) en tareas orientadas a la percepción, como la detección de objetos y la segmentación semántica, que anteriormente habían quedado rezagadas en comparación con los modelos específicos para tareas. Este marco reformula estas tareas en tres pasos interpretables: clasificación, conteo y anclaje.

L'introduction de CoT4Det, un cadre de réflexion en chaîne, vise à améliorer les performances des grands modèles de vision-langage (LVLM) sur des tâches orientées vers la perception telles que la détection d'objets et la segmentation sémantique, qui ont précédemment été en retard par rapport aux modèles spécifiques à des tâches. Ce cadre reformule ces tâches en trois étapes interprétables : classification, comptage et ancrage.

The introduction of CoT4Det, a Chain-of-Thought framework, aims to enhance the performance of Large Vision-Language Models (LVLMs) on perception-oriented tasks such as object detection and semantic segmentation, which have previously lagged behind task-specific models. This framework reformulates these tasks into three interpretable steps: classification, counting, and grounding.

OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe

Was this article worth reading? Share it

LucidQuery AI

Magicley AI

Augmeta