Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

arXiv — cs.CL•Thursday, November 13, 2025 at 5:00:00 AM

The study on Multimodal Large Language Models (MLLMs) investigates their skill composition capabilities across different modalities. Researchers designed three evaluation tasks that require the sequential combination of two modality-dependent skills. They assessed several MLLMs using direct prompting and a two-step cascaded inference approach. Findings indicate a significant cross-modality skill composition gap in all evaluated models. Although strategies like chain-of-thought prompting and specific fine-tuning recipes were explored to improve performance, they still fell short of bridging the gap. This underscores the necessity for ongoing research to enhance the skill composition abilities of MLLMs, which is vital for the future of AI applications.

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.CL10 hours ago

HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection

PositiveArtificial Intelligence

Recent advancements in out-of-context (OOC) misinformation detection have highlighted the need for improved consistency checks between image-text pairs and external evidence. The proposed HiEAG framework aims to enhance this process by utilizing multimodal large language models (MLLMs) to refine external consistency checking. This approach includes a comprehensive pipeline that integrates evidence reranking and rewriting, addressing the limitations of current methods that focus primarily on internal consistency.

Read full article

via arXiv — cs.CL

arXiv — cs.CV2 days ago

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models

NeutralArtificial Intelligence

The paper titled 'Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models' explores the underutilized potential of Multi-modal Large Language Models (MLLMs) in Document Image Quality Assessment (DIQA). It introduces a three-tiered evaluation framework that assesses MLLMs' capabilities at coarse, middle, and fine granularity levels. The study reveals that while MLLMs show early DIQA abilities, they face significant limitations, including inconsistent scoring and distortion misidentification.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Unifying Segment Anything in Microscopy with Vision-Language Knowledge

PositiveArtificial Intelligence

The paper titled 'Unifying Segment Anything in Microscopy with Vision-Language Knowledge' discusses the importance of accurate segmentation in biomedical images. It highlights the limitations of existing models in handling unseen domain data due to a lack of vision-language knowledge. The authors propose a new framework, uLLSAM, which utilizes Multimodal Large Language Models (MLLMs) to enhance segmentation performance. This approach aims to improve generalization capabilities across cross-domain datasets, achieving notable performance improvements.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model

PositiveArtificial Intelligence

MicroVQA++ is a newly introduced high-quality microscopy reasoning dataset designed for multimodal large language models (MLLMs). It is derived from the BIOMEDICA archive and consists of a three-stage process that includes expert-validated figure-caption pairs, a novel heterogeneous graph for filtering inconsistent samples, and human-checked multiple-choice questions. This dataset aims to enhance scientific reasoning in biomedical imaging, addressing the current limitations due to the lack of large-scale training data.

Read full article

via arXiv — cs.CV

arXiv — cs.CV2 days ago

Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery

PositiveArtificial Intelligence

The paper titled 'Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery' presents a novel framework for generating 3D urban environments using real-world satellite images. This approach addresses significant challenges in existing methods, such as the need for extensive 3D city assets and the limitations of semantic or height maps. By focusing on individual building entities, Sat2RealCity enhances realism and generalizability in urban modeling.

Read full article

via arXiv — cs.CV