arXiv:2511.19516v2 Announce Type: replace 
Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

تم تقديم إطار جديد يسمى GroundingAgent لتحسين الربط البصري، الذي يربط الاستفسارات النصية بمناطق محددة داخل الصور دون الحاجة إلى ضبط محدد للمهمة. تستخدم هذه الطريقة المبتكرة آلية تفكير منظمة تدمج كاشفات كائنات مدربة مسبقًا ونماذج لغوية متعددة الوسائط، محققة دقة ربط صفرية تبلغ 65.1% على معايير معروفة مثل RefCOCO وRefCOCOg.

Se ha presentado un nuevo marco llamado GroundingAgent para mejorar el grounding visual, que conecta consultas textuales con regiones específicas dentro de imágenes sin necesidad de ajuste específico para la tarea. Este enfoque innovador utiliza un mecanismo de razonamiento estructurado que integra detectores de objetos preentrenados y modelos de lenguaje multimodal, logrando una precisión de grounding de cero disparos del 65.1% en benchmarks establecidos como RefCOCO y RefCOCOg.

Un nouveau cadre appelé GroundingAgent a été introduit pour améliorer le grounding visuel, qui relie les requêtes textuelles à des régions spécifiques d'images sans nécessiter d'ajustements spécifiques à la tâche. Cette approche innovante utilise un mécanisme de raisonnement structuré qui intègre des détecteurs d'objets pré-entraînés et des modèles de langage multimodal, atteignant une précision de grounding à zéro coup de 65,1 % sur des benchmarks établis comme RefCOCO et RefCOCOg.

A new framework called GroundingAgent has been introduced to enhance visual grounding, which connects textual queries to specific image regions without the need for task-specific fine-tuning. This innovative approach utilizes a structured reasoning mechanism that integrates pretrained object detectors and multimodal language models, achieving a zero-shot grounding accuracy of 65.1% on established benchmarks like RefCOCO and RefCOCOg.

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

arXiv:2512.03454v1 Announce Type: new 
Abstract: Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

تم تقديم إطار جديد يسمى ThinkDeeper لتعزيز قدرات التثبيت البصري للمركبات المستقلة من خلال استخدام نموذج عالم حساس للمساحة (SA-WM). يتيح هذا النموذج للمركبات تفسير الأوامر بلغة طبيعية بشكل أكثر فعالية من خلال التفكير في الحالات المكانية المستقبلية وتوضيح التعليمات المعتمدة على السياق.

Se ha introducido un nuevo marco llamado ThinkDeeper para mejorar las capacidades de anclaje visual de los vehículos autónomos mediante el uso de un Modelo de Mundo Sensible al Espacio (SA-WM). Este modelo permite a los vehículos interpretar comandos en lenguaje natural de manera más efectiva al razonar sobre estados espaciales futuros y desambiguar instrucciones dependientes del contexto.

Un nouveau cadre appelé ThinkDeeper a été introduit pour améliorer les capacités de mise au sol visuelle des véhicules autonomes en utilisant un Modèle de Monde Sensible à l'Espace (SA-WM). Ce modèle permet aux véhicules d'interpréter plus efficacement les commandes en langage naturel en raisonnant sur les états spatiaux futurs et en désambiguïsant les instructions dépendantes du contexte.

A new framework called ThinkDeeper has been introduced to enhance the visual grounding capabilities of autonomous vehicles by utilizing a Spatial-Aware World Model (SA-WM). This model enables vehicles to interpret natural-language commands more effectively by reasoning about future spatial states and disambiguating context-dependent instructions.

Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Was this article worth reading? Share it

LucidQuery AI

The Visualizer

4o Image Gen