Causal Tracing of Object Representations in Large Vision Language Models: Mechanistic Interpretability and Hallucination Mitigation

arXiv — cs.CVWednesday, November 12, 2025 at 5:00:00 AM
The introduction of the Fine-grained Cross-modal Causal Tracing (FCCT) framework marks a significant advancement in the mechanistic interpretability of Large Vision-Language Models (LVLMs). Current analyses have been insufficient, failing to comprehensively examine the interactions between visual and textual tokens across various model components and layers. The FCCT framework systematically quantifies causal effects on visual object perception, revealing that multi-head self-attention (MHSA) in the middle layers is critical for aggregating cross-modal information. Additionally, feed-forward networks (FFNs) demonstrate a hierarchical progression in managing visual object representations. This research is pivotal as it not only enhances our understanding of LVLMs but also aids in developing strategies for hallucination mitigation, thereby improving the reliability of AI outputs in practical applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Curing Semantic Drift: A Dynamic Approach to Grounding Generation in Large Vision-Language Models
PositiveArtificial Intelligence
Large Vision-Language Models (LVLMs) often experience 'semantic drift', a phenomenon where they progressively detach from visual input, leading to hallucinations. Current training-free decoding strategies have limitations, including high computational costs and reliance on unreliable proxies. The introduction of Dynamic Logits Calibration (DLC) offers a novel, efficient solution to this issue. DLC operates in real-time, performing visual alignment checks to ensure that the generated outputs remain grounded in visual evidence.
Draft and Refine with Visual Experts
PositiveArtificial Intelligence
Recent advancements in Large Vision-Language Models (LVLMs) reveal their strong multimodal reasoning capabilities. However, these models often generate ungrounded or hallucinated responses due to an overreliance on linguistic priors rather than visual evidence. To address this issue, a new framework called Draft and Refine (DnR) has been proposed, which utilizes a question-conditioned metric to quantify the model's reliance on visual information, enhancing the accuracy and reliability of responses.
PAS : Prelim Attention Score for Detecting Object Hallucinations in Large Vision--Language Models
PositiveArtificial Intelligence
Large vision-language models (LVLMs) are increasingly recognized for their capabilities, but they face challenges due to object hallucinations. This study reveals that LVLMs often disregard the actual image and instead depend on previously generated output tokens to predict new objects. The research quantifies this behavior by analyzing the mutual information between the image and the predicted object, highlighting a strong correlation between weak image dependence and hallucination. The authors introduce the Prelim Attention Score (PAS), a novel, lightweight metric that can detect object hallucinations effectively without additional training.