DIQ-H: Evaluating Hallucination Persistence in VLMs Under Temporal Visual Degradation

arXiv — cs.CV•Thursday, December 4, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

The introduction of DIQ-H marks a significant advancement in evaluating the robustness of Vision-Language Models (VLMs) under conditions of temporal visual degradation, addressing critical failure modes such as hallucination persistence. This benchmark applies various physics-based corruptions to assess how VLMs recover from errors across multiple frames in dynamic environments.
This development is crucial for enhancing the reliability of VLMs in safety-critical applications, such as autonomous driving, where continuous visual processing is essential. By focusing on error recovery and temporal consistency, DIQ-H aims to improve the performance of VLMs in real-world scenarios where visual inputs may be compromised.
The challenges faced by VLMs, including their stability under minor input changes and their susceptibility to hallucinations, highlight ongoing concerns in the field of artificial intelligence. As researchers explore various frameworks and benchmarks to enhance VLM capabilities, the need for robust evaluation methods like DIQ-H becomes increasingly important to ensure these models can operate effectively in unpredictable environments.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

LucidQuery AI

Combines diffusion reasoning with autoregressive LLM for advanced AI analysis.

AI & DataTry the app

LangWatch

Monitor and improve your AI applications for quality, safety, and reliability.

AI & DataTry the app

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataTry the app

Continue Readings

arXiv — cs.CVa day ago

Hierarchical Process Reward Models are Symbolic Vision Learners

PositiveArtificial Intelligence

A novel self-supervised symbolic auto-encoder has been introduced, enabling symbolic computer vision to interpret diagrams through structured representations and logical rules. This approach contrasts with traditional pixel-based visual models by parsing diagrams into geometric primitives, enhancing machine vision's interpretability.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Object Counting with GPT-4o and GPT-5: A Comparative Study

PositiveArtificial Intelligence

A comparative study has been conducted on the object counting capabilities of two multi-modal large language models, GPT-4o and GPT-5, focusing on their performance in zero-shot scenarios using only textual prompts. The evaluation was carried out on the FSC-147 and CARPK datasets, revealing that both models achieved results comparable to state-of-the-art methods, with some instances exceeding them.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Look, Recite, Then Answer: Enhancing VLM Performance via Self-Generated Knowledge Hints

PositiveArtificial Intelligence

A new framework called 'Look, Recite, Then Answer' has been proposed to enhance the performance of Vision-Language Models (VLMs) by generating self-generated knowledge hints. This approach aims to address the limitations of VLMs in specialized fields like precision agriculture, where reasoning-driven hallucination can hinder accurate visual perception.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Can VLMs Detect and Localize Fine-Grained AI-Edited Images?

NeutralArtificial Intelligence

A recent study has introduced FragFake, a large-scale benchmark aimed at improving the detection and localization of fine-grained AI-edited images. This initiative addresses significant challenges in current AI-generated content (AIGC) detection methods, which often fail to pinpoint where edits occur and rely on expensive pixel-level annotations. The research explores the capabilities of vision language models (VLMs) in classifying edited images and identifying specific edited regions.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Language-Driven Object-Oriented Two-Stage Method for Scene Graph Anticipation

PositiveArtificial Intelligence

A new method for Scene Graph Anticipation (SGA) has been introduced, termed Linguistic Scene Graph Anticipation (LSGA), which utilizes a language-driven framework to enhance the prediction of future scene graphs from video clips. This approach aims to improve the understanding of dynamic scenes by integrating semantic dynamics and commonsense temporal regularities, which are often difficult to extract from visual features alone.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Too Late to Recall: Explaining the Two-Hop Problem in Multimodal Knowledge Retrieval

NeutralArtificial Intelligence

A recent study highlights the challenges faced by vision-language models (VLMs) in factual recall, identifying a two-hop problem that involves forming entity representations from visual inputs and recalling associated knowledge. The research benchmarks 14 VLMs, revealing that 11 of them show a decline in factual recall performance compared to their large language model (LLM) counterparts.

Read full article

via arXiv — cs.LG

arXiv — cs.CVa day ago

SpatialReasoner: Active Perception for Large-Scale 3D Scene Understanding

PositiveArtificial Intelligence

The introduction of SpatialReasoner marks a significant advancement in spatial reasoning for large-scale 3D environments, addressing challenges faced by existing vision-language models that are limited to smaller, room-scale scenarios. This framework utilizes the H$^2$U3D dataset, which encompasses multi-floor environments and generates diverse question-answer pairs to enhance 3D scene understanding.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

EEA: Exploration-Exploitation Agent for Long Video Understanding

PositiveArtificial Intelligence

The introduction of the EEA framework marks a significant advancement in long video understanding, addressing challenges related to the efficient navigation of extensive visual data. EEA balances exploration and exploitation through a hierarchical tree search process, enabling the autonomous discovery of task-relevant semantic queries and the collection of closely matched video frames as semantic anchors.

Read full article

via arXiv — cs.CV