arXiv:2512.07132v1 Announce Type: cross 
Abstract: Specialized visual tools can augment large language models or vision language models with expert knowledge (e.g., grounding, spatial reasoning, medical knowledge, etc.), but knowing which tools to call (and when to call them) can be challenging. We introduce DART, a multi-agent framework that uses disagreements between multiple debating visual agents to identify useful visual tools (e.g., object detection, OCR, spatial reasoning, etc.) that can resolve inter-agent disagreement. These tools allow for fruitful multi-agent discussion by introducing new information, and by providing tool-aligned agreement scores that highlight agents in agreement with expert tools, thereby facilitating discussion. We utilize an aggregator agent to select the best answer by providing the agent outputs and tool information. We test DART on four diverse benchmarks and show that our approach improves over multi-agent debate as well as over single agent tool-calling frameworks, beating the next-strongest baseline (multi-agent debate with a judge model) by 3.4% and 2.4% on A-OKVQA and MMMU respectively. We also find that DART adapts well to new tools in applied domains, with a 1.3% improvement on the M3D medical dataset over other strong tool-calling, single agent, and multi-agent baselines. Additionally, we measure text overlap across rounds to highlight the rich discussion in DART compared to existing multi-agent methods. Finally, we study the tool call distribution, finding that diverse tools are reliably used to help resolve disagreement.

DART هو إطار جديد متعدد الوكلاء يستخدم الخلافات بين الوكلاء البصريين لتحديد وتجنيد أدوات بصرية متخصصة لمهام التفكير متعدد الوسائط. تهدف هذه الطريقة إلى تحسين أداء نماذج اللغة الكبيرة ونماذج اللغة البصرية من خلال حل الخلافات بين الوكلاء باستخدام أدوات المعرفة الخبيرة مثل اكتشاف الكائنات والتفكير المكاني.

DART es un nuevo marco multiagente que utiliza los desacuerdos entre agentes visuales para identificar y reclutar herramientas visuales especializadas para tareas de razonamiento multimodal. Este enfoque tiene como objetivo mejorar el rendimiento de los grandes modelos de lenguaje y los modelos de lenguaje visual al resolver desacuerdos entre agentes mediante herramientas de conocimiento experto como la detección de objetos y el razonamiento espacial.

DART est un nouveau cadre multi-agents qui utilise les désaccords entre agents visuels pour identifier et recruter des outils visuels spécialisés pour des tâches de raisonnement multimodal. Cette approche vise à améliorer la performance des grands modèles de langage et des modèles de langage visuel en résolvant les désaccords entre agents grâce à des outils de connaissance d'expert tels que la détection d'objets et le raisonnement spatial.

DART is a newly introduced multi-agent framework that utilizes disagreements among visual agents to identify and recruit specialized visual tools for multimodal reasoning tasks. This approach aims to enhance the performance of large language models and vision-language models by resolving inter-agent disagreements through expert knowledge tools like object detection and spatial reasoning.

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

arXiv:2512.07141v1 Announce Type: new 
Abstract: As multimodal reasoning improves the overall capabilities of Large Vision Language Models (LVLMs), recent studies have begun to explore safety-oriented reasoning, aiming to enhance safety awareness by analyzing potential safety risks during the reasoning process before generating the final response. Although such approaches improve safety awareness and interpretability, this single-pass think-then-answer paradigm remains vulnerable to contextual or visual jailbreak attacks. This reveals a critical flaw: single-pass reasoning may overlook explicit harmful content in its own output. Our key insight is to exploit this wasted signal through reflection, which can effectively leverage the malicious content revealed in the first-pass reasoning to enable genuine self-correction and prevent unsafe generations. Motivated by this, we propose Think-Reflect-Revise (TRR), a three-stage training framework designed to enhance the safety alignment of LVLMs through policy-guided self-reflection. We first build a Reflective Safety Reasoning (ReSafe) dataset with 5,000 examples that follow a think-reflect-revise process. We then fine-tune the target model using the ReSafe dataset to initialize reflective behavior, and finally reinforce policy-guided reflection through reinforcement learning. Experimental results show that TRR substantially improves the safety performance of LVLMs across both safety-awareness benchmarks and jailbreak attack evaluations, increasing the overall safe response rate from 42.8% to 87.7% on Qwen2.5-VL-7B, while preserving stable performance on general benchmarks such as MMMU and MMStar. The project page is available at https://think-reflect-revise.github.io/.

تم اقتراح إطار جديد يسمى Think-Reflect-Revise (TRR) لتحسين توافق الأمان لنماذج اللغة البصرية الكبيرة (LVLMs) من خلال دمج عملية تدريب من ثلاث مراحل تسمح بالتصحيح الذاتي أثناء التفكير. تتناول هذه الطريقة الثغرات في التفكير من خلال تمرير واحد والتي قد تتجاهل المحتوى الضار في المخرجات.

Se ha propuesto un nuevo marco llamado Think-Reflect-Revise (TRR) para mejorar la alineación de seguridad de los Grandes Modelos de Lenguaje Visual (LVLMs) mediante un proceso de entrenamiento en tres etapas que permite la autocorrección durante el razonamiento. Este enfoque aborda las vulnerabilidades en el razonamiento de paso único que pueden pasar por alto contenido dañino en las salidas.

Un nouveau cadre appelé Think-Reflect-Revise (TRR) a été proposé pour améliorer l'alignement de la sécurité des grands modèles de langage visuel (LVLMs) en intégrant un processus de formation en trois étapes qui permet l'auto-correction lors du raisonnement. Cette approche traite les vulnérabilités dans le raisonnement à passage unique qui peuvent négliger le contenu nuisible dans les sorties.

A new framework called Think-Reflect-Revise (TRR) has been proposed to enhance the safety alignment of Large Vision Language Models (LVLMs) by incorporating a three-stage training process that allows for self-correction during reasoning. This approach addresses vulnerabilities in single-pass reasoning that may overlook harmful content in outputs.

DART: Leveraging Multi-Agent Disagreement for Tool Recruitment in Multimodal Reasoning

Was this article worth reading? Share it

LucidQuery AI

Magicley AI

Augmeta