arXiv:2505.16854v3 Announce Type: replace-cross 
Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across LLM (GSM8K), VLM (CLEVR, Super-CLEVR, GeoQA), and Agentic (AITZ) tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in RL approaches. Our code is available at https://github.com/kokolerk/TON.

تسلط دراسة حديثة الضوء على فعالية التعلم المعزز (RL) في تحسين قدرات التفكير في نماذج الرؤية واللغة (VLM). تشجع الطريقة المعروفة باسم تحسين السياسة النسبية الجماعية (GRPO) هذه النماذج على تطوير آثار تفكير شاملة قبل تقديم الإجابات. تحاكي هذه الطريقة عمليات التفكير البشرية، حيث تتجاوز الأسئلة الأبسط غالبًا التفكير التفصيلي. إن تداعيات هذا البحث مهمة، حيث يمكن أن تؤدي إلى أنظمة ذكاء اصطناعي أكثر تطورًا قادرة على الفهم الدقيق واتخاذ القرارات.

Un estudio reciente destaca la efectividad del Aprendizaje por Refuerzo (RL) para mejorar las capacidades de razonamiento en los modelos de visión-lenguaje (VLM). El método conocido como Optimización de Política Relativa de Grupo (GRPO) anima a estos modelos a desarrollar trazas de razonamiento completas antes de proporcionar respuestas. Este enfoque imita los procesos de pensamiento humano, donde las preguntas más simples a menudo evitan un razonamiento detallado. Las implicaciones de esta investigación son significativas, ya que podrían llevar a sistemas de IA más sofisticados capaces de comprensión y toma de decisiones matizadas.

Une étude récente met en lumière l'efficacité de l'apprentissage par renforcement (RL) pour améliorer les capacités de raisonnement des modèles de vision-langage (VLM). La méthode connue sous le nom d'optimisation de politique relative de groupe (GRPO) encourage ces modèles à développer des traces de raisonnement complètes avant de fournir des réponses. Cette approche imite les processus de pensée humains, où les questions plus simples contournent souvent un raisonnement détaillé. Les implications de cette recherche sont significatives, car elles pourraient conduire à des systèmes d'IA plus sophistiqués capables de compréhension et de prise de décision nuancées.

A recent study highlights the effectiveness of Reinforcement Learning (RL) in improving reasoning capabilities in vision-language models (VLMs). The method known as Group Relative Policy Optimization (GRPO) encourages these models to develop comprehensive reasoning traces before providing answers. This approach mimics human thought processes, where simpler questions often bypass detailed reasoning. The implications of this research are significant, as it could lead to more sophisticated AI systems capable of nuanced understanding and decision-making.

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

arXiv:2601.06204v2 Announce Type: replace 
Abstract: Intelligent anomaly detection in dynamic visual environments requires reconciling real-time performance with semantic interpretability. Conventional approaches address only fragments of this challenge. Reconstruction-based models capture low-level deviations without contextual reasoning, object detectors provide speed but limited semantics, and large vision-language systems deliver interpretability at prohibitive computational cost. This work introduces a cascading multi-agent framework that unifies these complementary paradigms into a coherent and interpretable architecture. Early modules perform reconstruction-gated filtering and object-level assessment, while higher-level reasoning agents are selectively invoked to interpret semantically ambiguous events. The system employs adaptive escalation thresholds and a publish-subscribe communication backbone, enabling asynchronous coordination and scalable deployment across heterogeneous hardware. Extensive evaluation on large-scale monitoring data demonstrates that the proposed cascade achieves a threefold reduction in latency compared to direct vision-language inference, while maintaining high perceptual fidelity (PSNR = 38.3 dB, SSIM = 0.965) and consistent semantic labeling. The framework advances beyond conventional detection pipelines by combining early-exit efficiency, adaptive multi-agent reasoning, and explainable anomaly attribution, establishing a reproducible and energy-efficient foundation for scalable intelligent visual monitoring.

تم تقديم إطار عمل جديد للكشف عن الشذوذ متعدد الوكلاء في أنظمة المراقبة، باستخدام نماذج الرؤية-اللغة وتصنيف قائم على التضمين لتحسين الأداء في الوقت الحقيقي وقابلية التفسير الدلالي. تتكامل هذه الطريقة مع منهجيات مختلفة، بما في ذلك التصفية المعتمدة على إعادة البناء والتقييم على مستوى الكائن، لمعالجة تعقيدات الكشف عن الشذوذ في البيئات البصرية الديناميكية.

Se ha introducido un nuevo marco para la detección de anomalías en cascada mediante múltiples agentes en sistemas de vigilancia, utilizando modelos de visión-lenguaje y clasificación basada en embeddings para mejorar el rendimiento en tiempo real y la interpretabilidad semántica. Este enfoque integra diversas metodologías, incluyendo filtrado basado en reconstrucción y evaluaciones a nivel de objeto, para abordar las complejidades de la detección de anomalías en entornos visuales dinámicos.

Un nouveau cadre pour la détection d'anomalies en cascade à plusieurs agents dans les systèmes de surveillance a été introduit, utilisant des modèles de vision-langage et une classification basée sur l'embedding pour améliorer la performance en temps réel et l'interprétabilité sémantique. Cette approche intègre diverses méthodologies, y compris le filtrage basé sur la reconstruction et les évaluations au niveau des objets, pour répondre aux complexités de la détection d'anomalies dans des environnements visuels dynamiques.

A new framework for cascading multi-agent anomaly detection in surveillance systems has been introduced, utilizing vision-language models and embedding-based classification to enhance real-time performance and semantic interpretability. This approach integrates various methodologies, including reconstruction-gated filtering and object-level assessments, to address the complexities of detecting anomalies in dynamic visual environments.

Cascading multi-agent anomaly detection in surveillance systems via vision-language models and embedding-based classification

arXiv:2508.13680v3 Announce Type: replace-cross 
Abstract: We introduce VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark designed to evaluate how vision-language models (VLMs) interpret and reason over visual and textual information beyond English. VMMU consists of 2.5k multimodal questions across 7 tasks, covering a diverse range of problem contexts, including STEM problem solving, data interpretation, rule-governed visual reasoning, and abstract visual reasoning. All questions require genuine multimodal integration, rather than reliance on text-only cues or OCR-based shortcuts. We evaluate a diverse set of state-of-the-art proprietary and open-source VLMs on VMMU. Despite strong Vietnamese OCR performance, proprietary models achieve only 66% mean accuracy. Further analysis shows that the primary source of failure is not OCR, but instead multimodal grounding and reasoning over text and visual evidence. Code and data are available at https://vmmu.github.io.

تم تقديم VMMU، وهو معيار فيتنامي لفهم وتقييم المهام المتعددة متعددة الوسائط، بهدف تقييم قدرات نماذج اللغة والرؤية (VLM) في تفسير واستنتاج المعلومات المرئية والنصية باللغة الفيتنامية. يتضمن هذا المعيار 2.5 ألف سؤال متعدد الوسائط عبر سبع مهام متنوعة، مع التركيز على التكامل الحقيقي متعدد الوسائط بدلاً من الاعتماد على الإشارات النصية فقط.

La introducción de VMMU, un referente vietnamita para la evaluación de la comprensión y el razonamiento multimodal, tiene como objetivo evaluar las capacidades de los modelos de lenguaje y visión (VLM) para interpretar y razonar sobre información visual y textual en vietnamita. Este referente incluye 2.5k preguntas multimodales en siete tareas diversas, enfatizando la integración multimodal genuina en lugar de depender únicamente de pistas textuales.

L'introduction de VMMU, une référence vietnamienne pour l'évaluation de la compréhension et du raisonnement multimodal, vise à évaluer les capacités des modèles de langage et de vision (VLM) à interpréter et raisonner sur des informations visuelles et textuelles en vietnamien. Cette référence comprend 2,5k questions multimodales réparties sur sept tâches diverses, mettant l'accent sur une véritable intégration multimodale plutôt que sur des indices uniquement textuels.

The introduction of VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark, aims to assess the capabilities of vision-language models (VLMs) in interpreting and reasoning over visual and textual information in Vietnamese. This benchmark includes 2.5k multimodal questions across seven diverse tasks, emphasizing genuine multimodal integration rather than text-only cues.

VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark

One More Thing in AI – Your Shortcut to AI Mastery

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

LangWatch

Solvice

CodeSpaced

Scop.ai

Ready to build your own newsroom?