PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models

arXiv — cs.CVMonday, November 17, 2025 at 5:00:00 AM
Large vision-language models (LVLMs) are increasingly recognized for their capabilities, but they face challenges due to object hallucinations. This study reveals that LVLMs often disregard the actual image and instead depend on previously generated output tokens to predict new objects. The research quantifies this behavior by analyzing the mutual information between the image and the predicted object, highlighting a strong correlation between weak image dependence and hallucination. The authors introduce the Prelim Attention Score (PAS), a novel, lightweight metric that can detect object hallucinations effectively without additional training.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models
PositiveArtificial Intelligence
Large Vision-Language Models (LVLMs) often experience 'semantic drift', a phenomenon where they progressively detach from visual input, leading to hallucinations. Current training-free decoding strategies have limitations, including high computational costs and reliance on unreliable proxies. The introduction of Dynamic Logits Calibration (DLC) offers a novel, efficient solution to this issue. DLC operates in real-time, performing visual alignment checks to ensure that the generated outputs remain grounded in visual evidence.
Draft and Refine with Visual Experts
PositiveArtificial Intelligence
Recent advancements in Large Vision-Language Models (LVLMs) reveal their strong multimodal reasoning capabilities. However, these models often generate ungrounded or hallucinated responses due to an overreliance on linguistic priors rather than visual evidence. To address this issue, a new framework called Draft and Refine (DnR) has been proposed, which utilizes a question-conditioned metric to quantify the model's reliance on visual information, enhancing the accuracy and reliability of responses.