Multimodal LLMs Do Not Compose Skills Optimally Across Modalities
NeutralArtificial Intelligence
The study on Multimodal Large Language Models (MLLMs) investigates their skill composition capabilities across different modalities. Researchers designed three evaluation tasks that require the sequential combination of two modality-dependent skills. They assessed several MLLMs using direct prompting and a two-step cascaded inference approach. Findings indicate a significant cross-modality skill composition gap in all evaluated models. Although strategies like chain-of-thought prompting and specific fine-tuning recipes were explored to improve performance, they still fell short of bridging the gap. This underscores the necessity for ongoing research to enhance the skill composition abilities of MLLMs, which is vital for the future of AI applications.
— via World Pulse Now AI Editorial System
