Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

arXiv — cs.CLThursday, November 13, 2025 at 5:00:00 AM
The study on Multimodal Large Language Models (MLLMs) investigates their skill composition capabilities across different modalities. Researchers designed three evaluation tasks that require the sequential combination of two modality-dependent skills. They assessed several MLLMs using direct prompting and a two-step cascaded inference approach. Findings indicate a significant cross-modality skill composition gap in all evaluated models. Although strategies like chain-of-thought prompting and specific fine-tuning recipes were explored to improve performance, they still fell short of bridging the gap. This underscores the necessity for ongoing research to enhance the skill composition abilities of MLLMs, which is vital for the future of AI applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
HiEAG: Evidence-Augmented Generation for Out-of-Context Misinformation Detection
PositiveArtificial Intelligence
Recent advancements in out-of-context (OOC) misinformation detection have highlighted the need for improved consistency checks between image-text pairs and external evidence. The proposed HiEAG framework aims to enhance this process by utilizing multimodal large language models (MLLMs) to refine external consistency checking. This approach includes a comprehensive pipeline that integrates evidence reranking and rewriting, addressing the limitations of current methods that focus primarily on internal consistency.
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
NeutralArtificial Intelligence
The paper titled 'Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models' explores the underutilized potential of Multi-modal Large Language Models (MLLMs) in Document Image Quality Assessment (DIQA). It introduces a three-tiered evaluation framework that assesses MLLMs' capabilities at coarse, middle, and fine granularity levels. The study reveals that while MLLMs show early DIQA abilities, they face significant limitations, including inconsistent scoring and distortion misidentification.
Unifying Segment Anything in Microscopy with Vision-Language Knowledge
PositiveArtificial Intelligence
The paper titled 'Unifying Segment Anything in Microscopy with Vision-Language Knowledge' discusses the importance of accurate segmentation in biomedical images. It highlights the limitations of existing models in handling unseen domain data due to a lack of vision-language knowledge. The authors propose a new framework, uLLSAM, which utilizes Multimodal Large Language Models (MLLMs) to enhance segmentation performance. This approach aims to improve generalization capabilities across cross-domain datasets, achieving notable performance improvements.
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
PositiveArtificial Intelligence
MicroVQA++ is a newly introduced high-quality microscopy reasoning dataset designed for multimodal large language models (MLLMs). It is derived from the BIOMEDICA archive and consists of a three-stage process that includes expert-validated figure-caption pairs, a novel heterogeneous graph for filtering inconsistent samples, and human-checked multiple-choice questions. This dataset aims to enhance scientific reasoning in biomedical imaging, addressing the current limitations due to the lack of large-scale training data.
Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery
PositiveArtificial Intelligence
The paper titled 'Sat2RealCity: Geometry-Aware and Appearance-Controllable 3D Urban Generation from Satellite Imagery' presents a novel framework for generating 3D urban environments using real-world satellite images. This approach addresses significant challenges in existing methods, such as the need for extensive 3D city assets and the limitations of semantic or height maps. By focusing on individual building entities, Sat2RealCity enhances realism and generalizability in urban modeling.