Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering

arXiv — cs.CV•Tuesday, November 25, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

Recent research has evaluated the performance of large vision language models (VLMs) in answering medical questions based on visual information, specifically using the EuropeMedQA Italian dataset. Four models were tested: Claude Sonnet 4.5, GPT-4o, GPT-5-mini, and Gemini 2.0 flash exp. The findings indicate varying degrees of visual grounding, with GPT-4o showing the most significant drop in accuracy when visual information was altered.
This investigation is crucial as it highlights the effectiveness and limitations of advanced VLMs in medical contexts, particularly in interpreting visual data, which is essential for accurate medical decision-making. Understanding these models' capabilities can inform their application in clinical settings and improve patient outcomes.
The study reflects ongoing debates about the reliability of AI models in critical fields like healthcare, where accuracy is paramount. While advancements in VLMs have been notable, concerns persist regarding their dependency on visual inputs and their ability to maintain performance under altered conditions. This raises questions about the robustness of AI in real-world applications, especially in sensitive areas such as medical diagnostics.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Airparser

Extract and parse data from documents using GPT-4 automation.

AI & DataTry the app

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataTry the app

4o Image Gen

Generate high-quality AI images with accurate text and precise object control.

Creative & DesignTry the app

Continue Readings

arXiv — cs.CVa day ago

Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

PositiveArtificial Intelligence

A new framework called Perception Loop Reasoning (PLR) has been introduced to enhance video understanding by addressing the limitations of existing Video Reasoning LLMs, which often rely on a flawed single-step perception paradigm. This framework integrates a loop-based approach with an anti-hallucination reward system to improve the accuracy and reliability of video analysis.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

LAST: LeArning to Think in Space and Time for Generalist Vision-Language Models

PositiveArtificial Intelligence

The introduction of LAST, or LeArning to Think in Space and Time, aims to enhance the capabilities of vision-language models (VLMs) by enabling them to better understand 3D spatial contexts and long video sequences using only 2D images as input. This approach contrasts with existing methods that typically address 3D and video tasks separately.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

PositiveArtificial Intelligence

The recent introduction of Video Retrieval-Augmented Generation (Video-RAG) addresses the challenges faced by large video-language models (LVLMs) in comprehending long videos due to limited context. This innovative approach utilizes visually-aligned auxiliary texts extracted from video data to enhance cross-modality alignment without the need for extensive fine-tuning or costly GPU resources.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation

PositiveArtificial Intelligence

A novel framework called Visual Contrast Exploitation (VCE) has been proposed to enhance the safety of autoregressive image generation models, which have gained attention for their ability to create highly realistic images. This framework aims to address concerns regarding the generation of Not-Safe-For-Work (NSFW) content and copyright infringement by introducing a method for constructing contrastive image pairs that effectively decouple unsafe content from the generated images.

Read full article

via arXiv — cs.CV

arXiv — cs.CLa day ago

Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

NegativeArtificial Intelligence

Large Language Models (LLMs) like GPT-4o have been evaluated for their effectiveness in assessing the difficulty of programming tasks, specifically through a comparison with a Light-GBM ensemble model. The study revealed that Light-GBM achieved 86% accuracy in classifying LeetCode problems, while GPT-4o only reached 37.75%, indicating significant limitations in LLMs for structured assessments.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Look It Up: Analysing Internal Web Search Capabilities of Modern LLMs

NeutralArtificial Intelligence

Modern large language models (LLMs) like GPT-5-mini and Claude Haiku 4.5 have been evaluated for their internal web search capabilities, revealing that while web access improves accuracy for static queries, it does not effectively enhance performance on dynamic queries due to poor query formulation. This assessment introduces a benchmark to measure the necessity and effectiveness of web searches in real-time responses.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Bridging Symbolic Control and Neural Reasoning in LLM Agents: The Structured Cognitive Loop

PositiveArtificial Intelligence

A new architecture called Structured Cognitive Loop (SCL) has been introduced to address fundamental issues in large language model agents, such as entangled reasoning and memory volatility. SCL separates cognition into five distinct phases: Retrieval, Cognition, Control, Action, and Memory, while employing Soft Symbolic Control to enhance explainability and controllability. Empirical tests show SCL achieves zero policy violations and maintains decision traceability.

Read full article

via arXiv — cs.CL

arXiv — cs.CLa day ago

Lessons from Studying Two-Hop Latent Reasoning

NeutralArtificial Intelligence

Recent research has focused on the latent reasoning capabilities of large language models (LLMs), specifically through a study on two-hop question answering. The investigation revealed that LLMs, including Llama 3 and GPT-4o, struggle with this basic reasoning task without employing chain-of-thought (CoT) techniques, which are essential for complex agentic tasks.

Read full article

via arXiv — cs.CL