Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

arXiv — cs.CLThursday, November 13, 2025 at 5:00:00 AM
The study on Multimodal Large Language Models (MLLMs) investigates their skill composition capabilities across different modalities. Researchers designed three evaluation tasks that require the sequential combination of two modality-dependent skills. They assessed several MLLMs using direct prompting and a two-step cascaded inference approach. Findings indicate a significant cross-modality skill composition gap in all evaluated models. Although strategies like chain-of-thought prompting and specific fine-tuning recipes were explored to improve performance, they still fell short of bridging the gap. This underscores the necessity for ongoing research to enhance the skill composition abilities of MLLMs, which is vital for the future of AI applications.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about