MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

arXiv — cs.CVWednesday, November 19, 2025 at 5:00:00 AM
  • MVI
  • The introduction of MVI
  • This development reflects a broader trend in AI research, emphasizing the importance of comprehensive evaluation methods that consider both visual and textual inputs, as well as the ongoing challenges in ensuring the reliability of AI systems in diverse contexts.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
PositiveArtificial Intelligence
DocSLM is a Small Vision-Language Model designed for efficient long-document understanding, addressing the limitations of Large Vision-Language Models (LVLMs) that require substantial memory. It features a Hierarchical Multimodal Compressor that encodes visual, textual, and layout information into a compact sequence, reducing memory usage while maintaining semantic integrity. Additionally, a Streaming Abstention mechanism allows for scalable processing of lengthy documents by filtering low-confidence responses.
Deep Equilibrium models for Poisson Imaging Inverse problems via Mirror Descent
PositiveArtificial Intelligence
Deep Equilibrium Models (DEQs) are implicit neural networks that have recently been applied to image regularization, particularly in Gaussian fidelity contexts. This study extends DEQs to Poisson inverse problems, utilizing the Kullback–Leibler divergence for data fidelity. A novel DEQ formulation based on Mirror Descent is introduced, adapting to the data term's structure. The research establishes sufficient conditions and convergence results using the Kurdyka–Lojasiewicz framework for subanalytic functions.
What happens when nanochat meets DiLoCo?
NeutralArtificial Intelligence
The article discusses the integration of the DiLoCo algorithm with the nanochat project, a compact implementation similar to ChatGPT. This integration aims to enhance training efficiency in distributed environments where communication is constrained. By applying DiLoCo as a lightweight wrapper around nanochat's training loop, the researchers can significantly reduce communication overhead by allowing multiple local training steps before synchronization. This approach is compared to a standard data-parallel setup, highlighting the potential for improved model training in resource-limited scenar…
Higher-Order Transformers With Kronecker-Structured Attention
PositiveArtificial Intelligence
The paper introduces the Higher-Order Transformer (HOT), a novel attention framework designed to handle high-dimensional, multiway tensor data. Traditional Transformers struggle with such data due to computational inefficiencies and the need to flatten inputs, which disrupts tensor structures. HOT utilizes Kronecker products to represent multiway attention, efficiently capturing relationships across dimensions while maintaining tensor integrity. Experiments demonstrate HOT's competitive performance on 2D and 3D datasets, retaining the expressiveness of full high-order attention.
Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding
PositiveArtificial Intelligence
The article presents a novel training strategy called Curriculum-based Relative Policy Optimization (CuRPO) aimed at improving Visual Grounding tasks. It highlights the limitations of Chain-of-Thought (CoT) prompting, particularly when outputs become lengthy or complex, which can degrade performance. The study reveals that simply increasing dataset size does not guarantee better results due to varying complexities. CuRPO utilizes CoT length and generalized Intersection over Union (gIoU) rewards to structure training data progressively from simpler to more challenging examples, demonstrating ef…
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs
PositiveArtificial Intelligence
The article introduces CORE (Compact Object-centric REpresentations), a novel approach to visual token compression in Large Vision-Language Models (LVLMs). Traditional token compression methods often struggle with high computational and memory costs due to the quadratic increase in visual tokens with image resolution. CORE utilizes an efficient segmentation decoder to create object masks, providing a semantic framework for merging visual tokens into compact representations. Additionally, a centroid-guided sorting mechanism ensures the spatial order of tokens is maintained, enhancing the overal…
MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions
NeutralArtificial Intelligence
MoHoBench is a newly developed benchmark aimed at assessing the honesty of Multimodal Large Language Models (MLLMs) when confronted with unanswerable visual questions. Despite advancements in vision-language tasks, MLLMs often produce unreliable content. This study systematically evaluates the honesty of 28 popular MLLMs using a dataset of over 12,000 visual questions, revealing that many models struggle to provide honest responses. The findings highlight the need for improved trustworthiness in AI systems.
Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
PositiveArtificial Intelligence
Supervised Fine-Tuning (SFT) is essential for adapting Large Language Models (LLMs) to specialized fields like medical reasoning. Current SFT methods often utilize unfiltered datasets, which can be redundant and of low quality, leading to high computational costs and poor performance. This study introduces a new data selection strategy called Difficulty-Influence Quadrant (DIQ), which aims to optimize sample selection based on both difficulty and optimization utility, enhancing the efficiency of medical reasoning applications.