arXiv:2511.17106v1 Announce Type: new 
Abstract: Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.

تم تقديم ChainV كإطار عمل يعزز التفكير متعدد الوسائط من خلال دمج تلميحات بصرية ديناميكيًا في عملية التفكير، مما يعالج مشكلات التكرار في سلاسل التفكير الطويلة. يقوم الإطار باختيار رقع بصرية بناءً على خطوات التفكير السابقة ويقوم بتحسينها من خلال تحديد أكثر التلميحات البصرية الذرية تمثيلاً، مما يحسن كفاءة نماذج التفكير.

Se ha presentado ChainV como un marco que mejora el razonamiento multimodal al integrar dinámicamente pistas visuales en el proceso de razonamiento, abordando problemas de redundancia en cadenas de razonamiento extensas. El marco selecciona parches visuales en función de los pasos de razonamiento previos y los refina al identificar las pistas visuales atómicas más representativas, mejorando así la eficiencia de los modelos de razonamiento.

ChainV a été introduit comme un cadre qui améliore le raisonnement multimodal en intégrant dynamiquement des indices visuels dans le processus de raisonnement, répondant ainsi aux problèmes de redondance dans les chaînes de raisonnement longues. Le cadre sélectionne des patches visuels en fonction des étapes de raisonnement précédentes et les affine en identifiant les indices visuels atomiques les plus représentatifs, améliorant ainsi l'efficacité des modèles de raisonnement.

ChainV has been introduced as a framework that enhances multimodal reasoning by dynamically integrating visual hints into the reasoning process, addressing issues of redundancy in lengthy reasoning chains. The framework selects visual patches based on previous reasoning steps and refines them by identifying the most representative atomic visual hints, improving the efficiency of reasoning models.

ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

arXiv:2506.09532v3 Announce Type: replace 
Abstract: We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

تم تقديم نموذج Athena-PRM كنموذج مكافأة عملية متعددة الوسائط يهدف إلى تقييم درجات المكافأة لكل خطوة في حل المشكلات المعقدة. يتناول هذا النموذج التحديات التي تواجه طرق التسمية الآلية التقليدية، والتي غالبًا ما تنتج نتائج ضوضائية وتكاليف حسابية مرتفعة، من خلال استخدام اتساق التنبؤ بين المكتملين الضعفاء والقويين لتوليد تسميات عملية موثوقة.

Athena-PRM se ha presentado como un modelo de recompensa de proceso multimodal diseñado para evaluar de manera eficiente las puntuaciones de recompensa para cada paso en la resolución de problemas de razonamiento complejos. Este modelo aborda los desafíos de los métodos de etiquetado automatizados tradicionales, que a menudo producen resultados ruidosos y altos costos computacionales, al utilizar la consistencia de predicción entre completadores débiles y fuertes para generar etiquetas de proceso confiables.

Athena-PRM a été introduit comme un modèle de récompense de processus multimodal visant à évaluer efficacement les scores de récompense pour chaque étape des tâches de raisonnement complexes. Ce modèle répond aux défis des méthodes d'étiquetage automatisées traditionnelles, qui produisent souvent des résultats bruyants et des coûts computationnels élevés, en utilisant la cohérence des prédictions entre des compléteurs faibles et forts pour générer des étiquettes de processus fiables.

Athena-PRM has been introduced as a multimodal process reward model aimed at efficiently evaluating reward scores for each step in complex reasoning tasks. This model addresses the challenges of traditional automated labeling methods, which often yield noisy results and high computational costs, by utilizing prediction consistency between weak and strong completers to generate reliable process labels.

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

arXiv:2511.16672v2 Announce Type: replace 
Abstract: Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

تم تقديم EvoLMM، وهو إطار ذاتي التطور لنماذج متعددة الوسائط الكبيرة، لتحسين قدرات التفكير دون الاعتماد على بيانات تم وضع علامات عليها بواسطة البشر. يتكون هذا الإطار من وكيلين تعاونيين: مقترح يقوم بإنشاء أسئلة متنوعة وحلّال يقوم بالإجابة عليها من خلال عملية مكافأة ذاتية مستمرة. تهدف هذه الابتكارات إلى تحسين استقلالية وقابلية توسيع نماذج متعددة الوسائط.

EvoLMM, un marco auto-evolutivo para grandes modelos multimodales, ha sido introducido para mejorar las capacidades de razonamiento sin depender de datos anotados por humanos. Este marco consiste en dos agentes cooperativos: un Proponente que genera preguntas diversas y un Solucionador que las responde a través de un proceso de auto-recompensa continua. Esta innovación busca mejorar la autonomía y escalabilidad de los modelos multimodales.

EvoLMM, un cadre auto-évolutif pour les grands modèles multimodaux, a été introduit pour améliorer les capacités de raisonnement sans dépendre de données annotées par des humains. Ce cadre se compose de deux agents coopératifs : un Proposeur qui génère des questions diverses et un Résolveur qui y répond par un processus d'auto-récompense continu. Cette innovation vise à améliorer l'autonomie et l'évolutivité des modèles multimodaux.

EvoLMM, a self-evolving framework for large multimodal models, has been introduced to enhance reasoning capabilities without relying on human-annotated data. This framework consists of two cooperative agents: a Proposer that generates diverse questions and a Solver that answers them through a continuous self-rewarding process. This innovation aims to improve the autonomy and scalability of multimodal models.

ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

Was this article worth reading? Share it