Lost in Translation and Noise: A Deep Dive into the Failure Modes of VLMs on Real-World Tables

arXiv — cs.CV•Monday, November 24, 2025 at 5:00:00 AM

NeutralArtificial Intelligence

The introduction of MirageTVQA, a new benchmark for evaluating Vision-Language Models (VLMs), highlights the significant performance gaps in existing datasets that primarily focus on monolingual and visually perfect tables. This benchmark includes nearly 60,000 QA pairs across 24 languages and incorporates realistic noise to better reflect real-world scenarios.
The development of MirageTVQA is crucial as it aims to bridge the gap between research and practical applications of VLMs, addressing the severe performance degradation observed in leading models when faced with visual noise and multilingual contexts.
This initiative underscores a broader concern within the AI community regarding the limitations of current evaluation metrics and benchmarks, which often overlook the complexities of real-world data. The focus on improving robustness against misleading inputs and enhancing reasoning capabilities in VLMs reflects ongoing efforts to create more reliable and versatile AI systems.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

The Visualizer

Transform complex topics into clear, visual explanations for effortless learning.

AI & DataTry the app

Mockmaster

Practice coding interviews with realistic questions and personalized feedback.

Business & ProductivityTry the app

TheQuizMaster

Master technical interviews with AI-powered practice and personalized feedback.

Business & ProductivityTry the app

Continue Readings

arXiv — cs.CVa day ago

MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

NeutralArtificial Intelligence

The introduction of MultiPriv marks a significant advancement in the evaluation of individual-level privacy reasoning within Vision-Language Models (VLMs). This benchmark addresses the inadequacies of current privacy assessments, which primarily focus on privacy perception rather than the ability of VLMs to link distributed information and construct individual profiles. The framework includes a novel bilingual multimodal dataset that features synthetic individual profiles linked to sensitive attributes.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

PositiveArtificial Intelligence

A new framework called MMT-ARD has been proposed to enhance the robustness of Vision-Language Models (VLMs) through a Multimodal Multi-Teacher Adversarial Distillation approach. This method addresses the limitations of traditional single-teacher distillation by incorporating a dual-teacher knowledge fusion architecture, which optimizes both clean feature preservation and robust feature enhancement.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

QuantFace: Efficient Quantization for Face Restoration

PositiveArtificial Intelligence

A novel low-bit quantization framework named QuantFace has been introduced to enhance face restoration models, which have been limited by heavy computational demands. This framework quantizes full-precision weights and activations from 32-bit to 4-6-bit, employing techniques like rotation-scaling channel balancing and Quantization-Distillation Low-Rank Adaptation (QD-LoRA) to optimize performance.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

Draft and Refine with Visual Experts

PositiveArtificial Intelligence

Recent advancements in Large Vision-Language Models (LVLMs) have led to the introduction of the Draft and Refine (DnR) framework, which enhances the models' reasoning capabilities by quantifying their reliance on visual evidence through a question-conditioned utilization metric. This approach aims to reduce ungrounded or hallucinated responses by refining initial drafts with targeted feedback from visual experts.

Read full article

via arXiv — cs.CV

arXiv — cs.CVa day ago

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

NeutralArtificial Intelligence

PhyBlock has been introduced as a progressive benchmark aimed at evaluating vision-language models (VLMs) on their physical understanding and planning capabilities through robotic 3D block assembly tasks. This benchmark features a four-level cognitive hierarchy assembly task and includes 2,600 tasks to assess models on spatial reasoning and physical comprehension.

Read full article

via arXiv — cs.CV

arXiv — cs.LGa day ago

Comprehensive Evaluation of Prototype Neural Networks

NeutralArtificial Intelligence

A comprehensive evaluation of prototype neural networks has been conducted, focusing on models such as ProtoPNet, ProtoPool, and PIPNet. The study applies a variety of metrics, including new ones proposed by the authors, to assess model interpretability across diverse datasets, including fine-grained and multi-label classification tasks. The code for these evaluations is available as an open-source library on GitHub.

Read full article

via arXiv — cs.LG

arXiv — cs.LGa day ago

MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

PositiveArtificial Intelligence

MOCHA, a new distillation framework, has been introduced to enhance personalized object detection by transferring multimodal knowledge from a frozen vision-language model (VLM) to a lightweight vision-only detector. This approach enables the effective recognition of user-specific instances from minimal examples without requiring modifications to the teacher model during inference.

Read full article

via arXiv — cs.LG

arXiv — cs.CLa day ago

Reproducibility Report: Test-Time Training on Nearest Neighbors for Large Language Models

PositiveArtificial Intelligence

A recent reproducibility report confirms the effectiveness of Test-Time Training on Nearest Neighbors for Large Language Models, demonstrating that fine-tuning language models like GPT-2 and GPT-Neo during inference can significantly reduce perplexity across various datasets, particularly in specialized domains such as GitHub and EuroParl.

Read full article

via arXiv — cs.CL