Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

arXiv — cs.CVTuesday, November 25, 2025 at 5:00:00 AM
  • Recent research has evaluated the performance of large vision language models (VLMs) in answering medical questions based on visual information, specifically using the EuropeMedQA Italian dataset. Four models were tested: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. The findings indicate varying degrees of visual grounding, with GPT-4o showing the most significant drop in accuracy when visual information was altered.
  • This investigation is crucial as it highlights the effectiveness and limitations of advanced VLMs in medical contexts, particularly in interpreting visual data, which is essential for accurate medical decision-making. Understanding these models' capabilities can inform their application in clinical settings and improve patient outcomes.
  • The study reflects ongoing debates about the reliability of AI models in critical fields like healthcare, where accuracy is paramount. While advancements in VLMs have been notable, concerns persist regarding their dependency on visual inputs and their ability to maintain performance under altered conditions. This raises questions about the robustness of AI in real-world applications, especially in sensitive areas such as medical diagnostics.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding
PositiveArtificial Intelligence
A new framework called Perception Loop Reasoning (PLR) has been introduced to enhance video understanding by addressing the limitations of existing Video Reasoning LLMs, which often rely on a flawed single-step perception paradigm. This framework integrates a loop-based approach with an anti-hallucination reward system to improve the accuracy and reliability of video analysis.
LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models
PositiveArtificial Intelligence
The introduction of LAST, or LeArning to Think in Space and Time, aims to enhance the capabilities of vision-language models (VLMs) by enabling them to better understand 3D spatial contexts and long video sequences using only 2D images as input. This approach contrasts with existing methods that typically address 3D and video tasks separately.
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
PositiveArtificial Intelligence
The recent introduction of Video Retrieval-Augmented Generation (Video-RAG) addresses the challenges faced by large video-language models (LVLMs) in comprehending long videos due to limited context. This innovative approach utilizes visually-aligned auxiliary texts extracted from video data to enhance cross-modality alignment without the need for extensive fine-tuning or costly GPU resources.
VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation
PositiveArtificial Intelligence
A novel framework called Visual Contrast Exploitation (VCE) has been proposed to enhance the safety of autoregressive image generation models, which have gained attention for their ability to create highly realistic images. This framework aims to address concerns regarding the generation of Not-Safe-For-Work (NSFW) content and copyright infringement by introducing a method for constructing contrastive image pairs that effectively decouple unsafe content from the generated images.
Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks
NegativeArtificial Intelligence
Large Language Models (LLMs) like GPT-4o have been evaluated for their effectiveness in assessing the difficulty of programming tasks, specifically through a comparison with a Light-GBM ensemble model. The study revealed that Light-GBM achieved 86% accuracy in classifying LeetCode problems, while GPT-4o only reached 37.75%, indicating significant limitations in LLMs for structured assessments.
Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs
NeutralArtificial Intelligence
Modern large language models (LLMs) like GPT-5-mini and Claude Haiku 4.5 have been evaluated for their internal web search capabilities, revealing that while web access improves accuracy for static queries, it does not effectively enhance performance on dynamic queries due to poor query formulation. This assessment introduces a benchmark to measure the necessity and effectiveness of web searches in real-time responses.
Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop
PositiveArtificial Intelligence
A new architecture called Structured Cognitive Loop (SCL) has been introduced to address fundamental issues in large language model agents, such as entangled reasoning and memory volatility. SCL separates cognition into five distinct phases: Retrieval, Cognition, Control, Action, and Memory, while employing Soft Symbolic Control to enhance explainability and controllability. Empirical tests show SCL achieves zero policy violations and maintains decision traceability.
Lessons from Studying Two-Hop Latent Reasoning
NeutralArtificial Intelligence
Recent research has focused on the latent reasoning capabilities of large language models (LLMs), specifically through a study on two-hop question answering. The investigation revealed that LLMs, including Llama 3 and GPT-4o, struggle with this basic reasoning task without employing chain-of-thought (CoT) techniques, which are essential for complex agentic tasks.