Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in Multimodal Large Language Models

arXiv — cs.CV•Thursday, November 20, 2025 at 5:00:00 AM

PositiveArtificial Intelligence

The study highlights the issue of verb hallucination in Multimodal Large Language Models (MLLMs), marking a significant step in understanding their limitations.
Addressing verb hallucination is crucial for improving the reliability of MLLMs, as verbs are essential for interpreting human actions and interactions.
This research contributes to ongoing discussions about the robustness of MLLMs, emphasizing the need for comprehensive evaluation methods that address both object and verb hallucinations.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CV8 hours ago

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

NeutralArtificial Intelligence

The paper presents a new metric called Physics-Constrained Multimodal Data Evaluation (PCMDE) aimed at improving the evaluation of multimodal synthetic images. Current metrics like BLEU and CIDEr often fail to accurately assess semantic and structural accuracy, particularly in specific domains. PCMDE integrates large language models with reasoning and vision-language models to enhance feature extraction, validation, and physics-guided reasoning.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

Context Cascade Compression: Exploring the Upper Limits of Text Compression

PositiveArtificial Intelligence

The research introduces Context Cascade Compression (C3), a method designed to tackle the challenges posed by million-level token inputs in long-context tasks for Large Language Models (LLMs). C3 utilizes two LLMs of varying sizes, where a smaller model compresses text into latent tokens, followed by a larger model that decodes this compressed context. Preliminary experiments indicate that C3 achieves a 20x compression ratio with 98% decoding accuracy.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

CompAgent: An Agentic Framework for Visual Compliance Verification

PositiveArtificial Intelligence

CompAgent is a newly proposed framework aimed at enhancing visual compliance verification in computer vision, particularly within media and advertising sectors. It addresses the limitations of existing methods that rely on deep learning models trained on manually labeled datasets. By integrating Multimodal Large Language Models (MLLMs) with various visual tools, CompAgent aims to improve the reasoning and application of compliance rules in visual content.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR Evaluation

PositiveArtificial Intelligence

FinCriticalED (Financial Critical Error Detection) is introduced as a visual benchmark for evaluating OCR and vision language models specifically on financial documents at the fact level. This benchmark addresses the challenges posed by the visually dense layouts of financial documents, where minor OCR errors can lead to significant misinterpretations. It provides 500 image-HTML pairs with expert-annotated financial facts, marking a shift from traditional metrics to a focus on factual correctness.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models

PositiveArtificial Intelligence

A comprehensive study on visual token redundancy in discrete diffusion-based multimodal large language models (dMLLMs) has been conducted, revealing significant computational overhead during inference due to full-sequence attention. The research highlights that visual redundancy primarily occurs in from-scratch dMLLMs when addressing long-answer tasks and examines the impact of visual token pruning on model efficiency and responses.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

Evaluating Multimodal Large Language Models on Vertically Written Japanese Text

NeutralArtificial Intelligence

This study evaluates the performance of Multimodal Large Language Models (MLLMs) on vertically written Japanese text, an area that has seen limited research. The authors generated a synthetic Japanese OCR dataset that includes both horizontal and vertical writing for model fine-tuning and evaluation. The findings aim to enhance the understanding of document images in Japanese, particularly those with vertical text formats.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

ANTS: Adaptive Negative Textual Space Shaping for OOD Detection via Test-Time MLLM Understanding and Reasoning

PositiveArtificial Intelligence

The paper presents ANTS, an innovative method for enhancing Out-of-Distribution (OOD) detection by utilizing Adaptive Negative Textual Space. By leveraging multimodal large language models (MLLMs), the approach generates expressive negative sentences that accurately characterize OOD distributions. This method addresses the limitations of existing techniques, particularly in near-OOD detection, by caching images likely to be OOD samples and prompting MLLMs for detailed descriptions.

Read full article

via arXiv — cs.CV

arXiv — cs.CV8 hours ago

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping

PositiveArtificial Intelligence

The paper introduces MoDES, a novel framework designed to enhance the efficiency of Mixture-of-Experts (MoE) Multimodal Large Language Models (MLLMs) by implementing dynamic expert skipping. Traditional expert skipping methods, originally intended for unimodal models, lead to performance degradation in MLLMs due to their unique characteristics. MoDES aims to address these inefficiencies without requiring additional training, utilizing a globally-modulated local gating mechanism for improved inference.

Read full article

via arXiv — cs.CV