Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

arXiv — cs.CVThursday, November 20, 2025 at 5:00:00 AM
  • The study highlights the issue of verb hallucination in Multimodal Large Language Models (MLLMs), marking a significant step in understanding their limitations.
  • Addressing verb hallucination is crucial for improving the reliability of MLLMs, as verbs are essential for interpreting human actions and interactions.
  • This research contributes to ongoing discussions about the robustness of MLLMs, emphasizing the need for comprehensive evaluation methods that address both object and verb hallucinations.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images
NeutralArtificial Intelligence
The paper presents a new metric called Physics-Constrained Multimodal Data Evaluation (PCMDE) aimed at improving the evaluation of multimodal synthetic images. Current metrics like BLEU and CIDEr often fail to accurately assess semantic and structural accuracy, particularly in specific domains. PCMDE integrates large language models with reasoning and vision-language models to enhance feature extraction, validation, and physics-guided reasoning.
Context Cascade Compression: Exploring the Upper Limits of Text Compression
PositiveArtificial Intelligence
The research introduces Context Cascade Compression (C3), a method designed to tackle the challenges posed by million-level token inputs in long-context tasks for Large Language Models (LLMs). C3 utilizes two LLMs of varying sizes, where a smaller model compresses text into latent tokens, followed by a larger model that decodes this compressed context. Preliminary experiments indicate that C3 achieves a 20x compression ratio with 98% decoding accuracy.
CompAgent: An Agentic Framework for Visual Compliance Verification
PositiveArtificial Intelligence
CompAgent is a newly proposed framework aimed at enhancing visual compliance verification in computer vision, particularly within media and advertising sectors. It addresses the limitations of existing methods that rely on deep learning models trained on manually labeled datasets. By integrating Multimodal Large Language Models (MLLMs) with various visual tools, CompAgent aims to improve the reasoning and application of compliance rules in visual content.
FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation
PositiveArtificial Intelligence
FinCriticalED (Financial Critical Error Detection) is introduced as a visual benchmark for evaluating OCR and vision language models specifically on financial documents at the fact level. This benchmark addresses the challenges posed by the visually dense layouts of financial documents, where minor OCR errors can lead to significant misinterpretations. It provides 500 image-HTML pairs with expert-annotated financial facts, marking a shift from traditional metrics to a focus on factual correctness.
A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models
PositiveArtificial Intelligence
A comprehensive study on visual token redundancy in discrete diffusion-based multimodal large language models (dMLLMs) has been conducted, revealing significant computational overhead during inference due to full-sequence attention. The research highlights that visual redundancy primarily occurs in from-scratch dMLLMs when addressing long-answer tasks and examines the impact of visual token pruning on model efficiency and responses.
Evaluating Multimodal Large Language Models on Vertically Written Japanese Text
NeutralArtificial Intelligence
This study evaluates the performance of Multimodal Large Language Models (MLLMs) on vertically written Japanese text, an area that has seen limited research. The authors generated a synthetic Japanese OCR dataset that includes both horizontal and vertical writing for model fine-tuning and evaluation. The findings aim to enhance the understanding of document images in Japanese, particularly those with vertical text formats.
ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning
PositiveArtificial Intelligence
The paper presents ANTS, an innovative method for enhancing Out-of-Distribution (OOD) detection by utilizing Adaptive Negative Textual Space. By leveraging multimodal large language models (MLLMs), the approach generates expressive negative sentences that accurately characterize OOD distributions. This method addresses the limitations of existing techniques, particularly in near-OOD detection, by caching images likely to be OOD samples and prompting MLLMs for detailed descriptions.
MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
PositiveArtificial Intelligence
The paper introduces MoDES, a novel framework designed to enhance the efficiency of Mixture-of-Experts (MoE) Multimodal Large Language Models (MLLMs) by implementing dynamic expert skipping. Traditional expert skipping methods, originally intended for unimodal models, lead to performance degradation in MLLMs due to their unique characteristics. MoDES aims to address these inefficiencies without requiring additional training, utilizing a globally-modulated local gating mechanism for improved inference.