arXiv:2511.17384v1 Announce Type: cross 
Abstract: While Visual Large Language Models (VLLMs) show great promise as embodied agents, they continue to face substantial challenges in spatial reasoning. Existing embodied benchmarks largely focus on passive, static household environments and evaluate only isolated capabilities, failing to capture holistic performance in dynamic, real-world complexity. To fill this gap, we present IndustryNav, the first dynamic industrial navigation benchmark for active spatial reasoning. IndustryNav leverages 12 manually created, high-fidelity Unity warehouse scenarios featuring dynamic objects and human movement. Our evaluation employs a PointGoal navigation pipeline that effectively combines egocentric vision with global odometry to assess holistic local-global planning. Crucially, we introduce the "collision rate" and "warning rate" metrics to measure safety-oriented behaviors and distance estimation. A comprehensive study of nine state-of-the-art VLLMs (including models such as GPT-5-mini, Claude-4.5, and Gemini-2.5) reveals that closed-source models maintain a consistent advantage; however, all agents exhibit notable deficiencies in robust path planning, collision avoidance and active exploration. This highlights a critical need for embodied research to move beyond passive perception and toward tasks that demand stable planning, active exploration, and safe behavior in dynamic, real-world environment.

تم تقديم IndustryNav كأول معيار للتنقل الصناعي الديناميكي يهدف إلى تحسين التفكير المكاني في الوكلاء المجسدين. يستخدم هذا المعيار 12 سيناريو مستودع عالي الدقة من Unity يتضمن كائنات ديناميكية وحركة بشرية، مما يعالج قيود المعايير الحالية التي تركز على البيئات الثابتة.

Se ha presentado IndustryNav como el primer benchmark de navegación industrial dinámica destinado a mejorar el razonamiento espacial en agentes incorporados. Este benchmark utiliza 12 escenarios de almacén de Unity de alta fidelidad que incorporan objetos dinámicos y movimiento humano, abordando las limitaciones de los benchmarks existentes que se centran en entornos estáticos.

IndustryNav a été introduit comme le premier benchmark de navigation industrielle dynamique visant à améliorer le raisonnement spatial des agents incarnés. Ce benchmark utilise 12 scénarios d'entrepôt Unity de haute fidélité intégrant des objets dynamiques et des mouvements humains, répondant ainsi aux limites des benchmarks existants qui se concentrent sur des environnements statiques.

IndustryNav has been introduced as the first dynamic industrial navigation benchmark aimed at enhancing spatial reasoning in embodied agents. This benchmark utilizes 12 high-fidelity Unity warehouse scenarios that incorporate dynamic objects and human movement, addressing the limitations of existing benchmarks that focus on static environments.

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

arXiv:2511.18931v1 Announce Type: new 
Abstract: Modern large language models integrate web search to provide real-time answers, yet it remains unclear whether they are efficiently calibrated to use search when it is actually needed. We introduce a benchmark evaluating both the necessity and effectiveness of web access across commercial models with no access to internal states or parameters. The dataset includes a static split of 783 temporally anchored questions answerable from pre-cutoff knowledge, aimed at testing whether models invoke search based on low internal confidence, and a dynamic split of 288 post-cutoff queries designed to test whether models recognise when search is required and retrieve updated information. Web access substantially improves static accuracy for GPT-5-mini and Claude Haiku 4.5, though confidence calibration worsens. On dynamic queries, both models frequently invoke search yet remain below 70 percent accuracy due to weak query formulation. Costs per accuracy-improving call remain low, but returns diminish once initial retrieval fails. Selective invocation helps, but models become overconfident and inconsistent after search. Overall, built-in web search meaningfully improves factual accuracy and can be invoked selectively, yet models remain overconfident, skip retrieval when it is essential, and falter once initial search queries underperform. Taken together, internal web search works better as a good low-latency verification layer than a reliable analytical tool, with clear room for improvement.

تم تقييم نماذج اللغة الحديثة (LLMs) مثل GPT-5-mini وClaude Haiku 4.5 من حيث قدراتها على البحث الداخلي في الويب، مما يكشف أن الوصول إلى الويب يحسن الدقة في الاستفسارات الثابتة، لكنه لا يحسن الأداء بشكل فعال في الاستفسارات الديناميكية بسبب ضعف صياغة الاستفسارات. تقدم هذه التقييمات معيارًا لقياس الحاجة والفعالية للبحث على الويب في الاستجابات في الوقت الحقيقي.

Los modelos de lenguaje modernos (LLMs) como GPT-5-mini y Claude Haiku 4.5 han sido evaluados por sus capacidades de búsqueda interna en la web, revelando que el acceso a la web mejora la precisión en consultas estáticas, pero no mejora eficazmente el rendimiento en consultas dinámicas debido a una mala formulación de las consultas. Esta evaluación introduce un benchmark para medir la necesidad y efectividad de las búsquedas en la web en respuestas en tiempo real.

Les modèles de langage modernes (LLMs) comme GPT-5-mini et Claude Haiku 4.5 ont été évalués pour leurs capacités de recherche interne sur le web, révélant que l'accès au web améliore l'exactitude des requêtes statiques, mais n'améliore pas efficacement les performances sur les requêtes dynamiques en raison d'une mauvaise formulation des requêtes. Cette évaluation introduit un benchmark pour mesurer la nécessité et l'efficacité des recherches sur le web dans les réponses en temps réel.

Modern large language models (LLMs) like GPT-5-mini and Claude Haiku 4.5 have been evaluated for their internal web search capabilities, revealing that while web access improves accuracy for static queries, it does not effectively enhance performance on dynamic queries due to poor query formulation. This assessment introduces a benchmark to measure the necessity and effectiveness of web searches in real-time responses.

Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

arXiv:2511.19220v1 Announce Type: new 
Abstract: Large vision language models (VLMs) have achieved impressive performance on medical visual question answering benchmarks, yet their reliance on visual information remains unclear. We investigate whether frontier VLMs demonstrate genuine visual grounding when answering Italian medical questions by testing four state-of-the-art models: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. Using 60 questions from the EuropeMedQA Italian dataset that explicitly require image interpretation, we substitute correct medical images with blank placeholders to test whether models truly integrate visual and textual information. Our results reveal striking variability in visual dependency: GPT-4o shows the strongest visual grounding with a 27.9pp accuracy drop (83.2% [74.6%, 91.7%] to 55.3% [44.1%, 66.6%]), while GPT-5-mini, Gemini, and Claude maintain high accuracy with modest drops of 8.5pp, 2.4pp, and 5.6pp respectively. Analysis of model-generated reasoning reveals confident explanations for fabricated visual interpretations across all models, suggesting varying degrees of reliance on textual shortcuts versus genuine visual analysis. These findings highlight critical differences in model robustness and the need for rigorous evaluation before clinical deployment.

أظهرت الأبحاث الأخيرة أداء نماذج اللغة البصرية الكبيرة (VLM) في الإجابة على الأسئلة الطبية بناءً على المعلومات المرئية، باستخدام مجموعة بيانات EuropeMedQA الإيطالية. تم اختبار أربعة نماذج: Claude Sonnet 4.5 و GPT-4o و GPT-5-mini و Gemini 2.0 flash exp. تشير النتائج إلى درجات متفاوتة من التأسيس البصري، حيث أظهر GPT-4o أكبر انخفاض في الدقة عند تغيير المعلومات المرئية.

Investigaciones recientes han evaluado el rendimiento de grandes modelos de lenguaje visual (VLM) en la respuesta a preguntas médicas basadas en información visual, utilizando específicamente el conjunto de datos EuropeMedQA en italiano. Se probaron cuatro modelos: Claude Sonnet 4.5, GPT-4o, GPT-5-mini y Gemini 2.0 flash exp. Los hallazgos indican grados variables de anclaje visual, siendo GPT-4o el que mostró la mayor caída en precisión cuando se alteró la información visual.

Des recherches récentes ont évalué la performance des grands modèles de langage visuel (VLM) dans la réponse à des questions médicales basées sur des informations visuelles, en utilisant spécifiquement le jeu de données EuropeMedQA en italien. Quatre modèles ont été testés : Claude Sonnet 4.5, GPT-4o, GPT-5-mini et Gemini 2.0 flash exp. Les résultats indiquent des degrés variés de fondation visuelle, GPT-4o montrant la plus forte baisse de précision lorsque les informations visuelles étaient modifiées.

Recent research has evaluated the performance of large vision language models (VLMs) in answering medical questions based on visual information, specifically using the EuropeMedQA Italian dataset. Four models were tested: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. The findings indicate varying degrees of visual grounding, with GPT-4o showing the most significant drop in accuracy when visual information was altered.

IndustryNav: Exploring Spatial Reasoning of Embodied Agents in Dynamic Industrial Navigation

Was this article worth reading? Share it