Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

arXiv — cs.CVMonday, November 24, 2025 at 5:00:00 AM
  • The introduction of MirageTVQA, a new benchmark for evaluating Vision-Language Models (VLMs), highlights the significant performance gaps in existing datasets that primarily focus on monolingual and visually perfect tables. This benchmark includes nearly 60,000 QA pairs across 24 languages and incorporates realistic noise to better reflect real-world scenarios.
  • The development of MirageTVQA is crucial as it aims to bridge the gap between research and practical applications of VLMs, addressing the severe performance degradation observed in leading models when faced with visual noise and multilingual contexts.
  • This initiative underscores a broader concern within the AI community regarding the limitations of current evaluation metrics and benchmarks, which often overlook the complexities of real-world data. The focus on improving robustness against misleading inputs and enhancing reasoning capabilities in VLMs reflects ongoing efforts to create more reliable and versatile AI systems.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
GraphFusionSBR: Denoising Multi-Channel Graphs for Session-Based Recommendation
PositiveArtificial Intelligence
A new model named GraphFusionSBR has been introduced to enhance session-based recommendation systems by effectively capturing implicit user intents while addressing issues like item interaction dominance and noisy sessions. This model integrates multiple channels, including knowledge graphs and hypergraphs, to improve recommendation accuracy across various domains such as e-commerce and multimedia.
Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System
NeutralArtificial Intelligence
A recent study has investigated the dynamics of Large Language Model (LLM) agent reviewers within an Elo-ranked review system, utilizing real-world conference paper submissions. The research involved multiple LLM reviewers with distinct personas engaging in multi-round review interactions, moderated by an Area Chair, and highlighted the impact of Elo ratings and reviewer memory on decision-making accuracy.
VMMU: A Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark
NeutralArtificial Intelligence
The introduction of VMMU, a Vietnamese Multitask Multimodal Understanding and Reasoning Benchmark, aims to assess the capabilities of vision-language models (VLMs) in interpreting and reasoning over visual and textual information in Vietnamese. This benchmark includes 2.5k multimodal questions across seven diverse tasks, emphasizing genuine multimodal integration rather than text-only cues.
Subspace Alignment for Vision-Language Model Test-time Adaptation
PositiveArtificial Intelligence
A new approach called SubTTA has been proposed to enhance test-time adaptation (TTA) for Vision-Language Models (VLMs), addressing vulnerabilities to distribution shifts that can misguide adaptation through unreliable zero-shot predictions. SubTTA aligns the semantic subspaces of visual and textual modalities to improve the accuracy of predictions during adaptation.
Route, Retrieve, Reflect, Repair: Self-Improving Agentic Framework for Visual Detection and Linguistic Reasoning in Medical Imaging
PositiveArtificial Intelligence
A new framework named R^4 has been proposed to enhance medical image analysis by integrating Vision-Language Models (VLMs) into a multi-agent system that includes a Router, Retriever, Reflector, and Repairer, specifically focusing on chest X-ray analysis. This approach aims to improve reasoning, safety, and spatial grounding in medical imaging workflows.
REVNET: Rotation-Equivariant Point Cloud Completion via Vector Neuron Anchor Transformer
PositiveArtificial Intelligence
The introduction of the Rotation-Equivariant Anchor Transformer (REVNET) aims to enhance point cloud completion by addressing the limitations of existing methods that struggle with arbitrary rotations. This novel framework utilizes Vector Neuron networks to predict missing data in point clouds, which is crucial for applications relying on accurate 3D representations.
Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling
PositiveArtificial Intelligence
A new study has introduced a subject decoupling framework for zero-shot distracted driver detection using Vision Language Models (VLMs). This approach aims to improve the accuracy of detecting driver distractions by separating appearance factors from behavioral cues, addressing a significant limitation in existing VLM-based systems.
Linus Torvalds has started vibe coding, just not on Linux
NeutralArtificial Intelligence
Linus Torvalds has initiated a new project named AudioNoise, which focuses on digital audio effects and signal processing, and is available on his GitHub. This project stems from his previous hardware experiment, GuitarPedal, where he created homemade guitar effects pedals to deepen his understanding of audio technology.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about