arXiv:2510.25801v2 Announce Type: replace 
Abstract: Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.

تناقش المقالة إطار Metis-SPECS، الذي يتناول التحديات في التعلم المعزز (RL) لنماذج اللغة البصرية. تنتقد الطريقة التقليدية لبدء التشغيل البارد باستخدام الضبط الدقيق الخاضع للإشراف (SFT)، مشيرة إلى قيودها في التعميم وإمكانية الإفراط في التكيف. يقدم المؤلفون عامل التعميم (GF) لقياس قدرات التعميم ويجدون أن طرق التدريب المستندة إلى التفضيلات، مثل DPO، تتفوق على SFT في بدء التشغيل البارد، مما يؤدي إلى اقتراح SPECS.

El artículo discute el marco Metis-SPECS, que aborda los desafíos del aprendizaje por refuerzo (RL) en modelos de lenguaje visual. Critica el método tradicional de inicio en frío que utiliza el ajuste fino supervisado (SFT), destacando sus limitaciones en la generalización y el posible sobreajuste. Los autores introducen el Factor de Generalización (GF) para medir las capacidades de generalización y encuentran que los métodos de entrenamiento basados en preferencias, como DPO, superan a SFT en inicios en frío, lo que lleva a la propuesta de SPECS.

L'article traite du cadre Metis-SPECS, qui aborde les défis de l'apprentissage par renforcement (RL) pour les modèles de langage visuel. Il critique la méthode traditionnelle de démarrage à froid utilisant l'affinage supervisé (SFT), soulignant ses limites en matière de généralisation et de surajustement potentiel. Les auteurs introduisent le facteur de généralisation (GF) pour mesurer les capacités de généralisation et constatent que les méthodes d'entraînement basées sur les préférences, telles que DPO, surpassent SFT dans les démarrages à froid, menant à la proposition de SPECS.

The article discusses the Metis-SPECS framework, which addresses challenges in reinforcement learning (RL) for vision language models. It critiques the traditional cold start method using supervised fine-tuning (SFT), highlighting its limitations in generalization and potential overfitting. The authors introduce the Generalization Factor (GF) to measure generalization capabilities and find that preference-based training methods, such as DPO, outperform SFT in cold starts, leading to the proposal of SPECS.

Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

arXiv:2511.17106v1 Announce Type: new 
Abstract: Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.

تم تقديم ChainV كإطار عمل يعزز التفكير متعدد الوسائط من خلال دمج تلميحات بصرية ديناميكيًا في عملية التفكير، مما يعالج مشكلات التكرار في سلاسل التفكير الطويلة. يقوم الإطار باختيار رقع بصرية بناءً على خطوات التفكير السابقة ويقوم بتحسينها من خلال تحديد أكثر التلميحات البصرية الذرية تمثيلاً، مما يحسن كفاءة نماذج التفكير.

Se ha presentado ChainV como un marco que mejora el razonamiento multimodal al integrar dinámicamente pistas visuales en el proceso de razonamiento, abordando problemas de redundancia en cadenas de razonamiento extensas. El marco selecciona parches visuales en función de los pasos de razonamiento previos y los refina al identificar las pistas visuales atómicas más representativas, mejorando así la eficiencia de los modelos de razonamiento.

ChainV a été introduit comme un cadre qui améliore le raisonnement multimodal en intégrant dynamiquement des indices visuels dans le processus de raisonnement, répondant ainsi aux problèmes de redondance dans les chaînes de raisonnement longues. Le cadre sélectionne des patches visuels en fonction des étapes de raisonnement précédentes et les affine en identifiant les indices visuels atomiques les plus représentatifs, améliorant ainsi l'efficacité des modèles de raisonnement.

ChainV has been introduced as a framework that enhances multimodal reasoning by dynamically integrating visual hints into the reasoning process, addressing issues of redundancy in lengthy reasoning chains. The framework selects visual patches based on previous reasoning steps and refines them by identifying the most representative atomic visual hints, improving the efficiency of reasoning models.

ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

arXiv:2511.16672v2 Announce Type: replace 
Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

تم تقديم EvoLMM، وهو إطار ذاتي التطور لنماذج متعددة الوسائط الكبيرة، لتحسين قدرات التفكير دون الاعتماد على بيانات تم وضع علامات عليها بواسطة البشر. يتكون هذا الإطار من وكيلين تعاونيين: مقترح يقوم بإنشاء أسئلة متنوعة وحلّال يقوم بالإجابة عليها من خلال عملية مكافأة ذاتية مستمرة. تهدف هذه الابتكارات إلى تحسين استقلالية وقابلية توسيع نماذج متعددة الوسائط.

EvoLMM, un marco auto-evolutivo para grandes modelos multimodales, ha sido introducido para mejorar las capacidades de razonamiento sin depender de datos anotados por humanos. Este marco consiste en dos agentes cooperativos: un Proponente que genera preguntas diversas y un Solucionador que las responde a través de un proceso de auto-recompensa continua. Esta innovación busca mejorar la autonomía y escalabilidad de los modelos multimodales.

EvoLMM, un cadre auto-évolutif pour les grands modèles multimodaux, a été introduit pour améliorer les capacités de raisonnement sans dépendre de données annotées par des humains. Ce cadre se compose de deux agents coopératifs : un Proposeur qui génère des questions diverses et un Résolveur qui y répond par un processus d'auto-récompense continu. Cette innovation vise à améliorer l'autonomie et l'évolutivité des modèles multimodaux.

EvoLMM, a self-evolving framework for large multimodal models, has been introduced to enhance reasoning capabilities without relying on human-annotated data. This framework consists of two cooperative agents: a Proposer that generates diverse questions and a Solver that answers them through a continuous self-rewarding process. This innovation aims to improve the autonomy and scalability of multimodal models.

Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

Was this article worth reading? Share it

Meteoria

Supametas.AI

ClassX