arXiv:2511.17397v1 Announce Type: new 
Abstract: Multimodal Action Quality Assessment (AQA) has recently emerged as a promising paradigm. By leveraging complementary information across shared contextual cues, it enhances the discriminative evaluation of subtle intra-class variations in highly similar action sequences. However, partial modalities are frequently unavailable at the inference stage in reality. The absence of any modality often renders existing multimodal models inoperable. Furthermore, it triggers catastrophic performance degradation due to interruptions in cross-modal interactions. To address this issue, we propose a novel Missing Completion Framework with Mixture of Experts (MCMoE) that unifies unimodal and joint representation learning in single-stage training. Specifically, we propose an adaptive gated modality generator that dynamically fuses available information to reconstruct missing modalities. We then design modality experts to learn unimodal knowledge and dynamically mix the knowledge of all experts to extract cross-modal joint representations. With a mixture of experts, missing modalities are further refined and complemented. Finally, in the training phase, we mine the complete multimodal features and unimodal expert knowledge to guide modality generation and generation-based joint representation extraction. Extensive experiments demonstrate that our MCMoE achieves state-of-the-art results in both complete and incomplete multimodal learning on three public AQA benchmarks. Code is available at https://github.com/XuHuangbiao/MCMoE.

تم اقتراح إطار جديد يسمى MCMoE لمعالجة التحديات المتعلقة بتقييم جودة العمل متعدد الوسائط (AQA)، خاصة عندما تكون بعض الوسائط مفقودة أثناء الاستدلال. يدمج هذا الإطار التعلم التمثيلي الأحادي والموحد من خلال عملية تدريب من مرحلة واحدة، باستخدام مولد وسائط متكيف لإعادة بناء الوسائط المفقودة.

Se ha propuesto un nuevo marco llamado MCMoE para abordar los desafíos de la Evaluación de Calidad de Acción Multimodal (AQA), especialmente cuando faltan ciertas modalidades durante la inferencia. Este marco integra el aprendizaje de representaciones unimodales y conjuntas a través de un proceso de entrenamiento de una sola etapa, utilizando un generador de modalidades adaptativo para reconstruir las modalidades ausentes.

Un nouveau cadre appelé MCMoE a été proposé pour relever les défis de l'évaluation de la qualité des actions multimodales (AQA), en particulier lorsque certaines modalités sont manquantes lors de l'inférence. Ce cadre intègre l'apprentissage de représentations unimodales et conjointes à travers un processus d'entraînement en une seule étape, utilisant un générateur de modalités adaptatif pour reconstruire les modalités absentes.

A new framework called MCMoE has been proposed to address the challenges of Multimodal Action Quality Assessment (AQA), particularly when certain modalities are missing during inference. This framework integrates unimodal and joint representation learning through a single-stage training process, utilizing an adaptive gated modality generator to reconstruct absent modalities.

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

arXiv:2506.10016v3 Announce Type: replace-cross 
Abstract: Multimodal Generative Models (MGMs) have rapidly evolved beyond text generation, now spanning diverse output modalities including images, music, video, human motion, and 3D objects, by integrating language with other sensory modalities under unified architectures. This survey categorises six primary generative modalities and examines how foundational techniques, namely Self-Supervised Learning (SSL), Mixture of Experts (MoE), Reinforcement Learning from Human Feedback (RLHF), and Chain-of-Thought (CoT) prompting, enable cross-modal capabilities. We analyze key models, architectural trends, and emergent cross-modal synergies, while highlighting transferable techniques and unresolved challenges. Building on a common taxonomy of models and training recipes, we propose a unified evaluation framework centred on faithfulness, compositionality, and robustness, and synthesise evidence from benchmarks and human studies across modalities. We further analyse trustworthiness, safety, and ethical risks, including multimodal bias, privacy leakage, and the misuse of high-fidelity media generation for deepfakes, disinformation, and copyright infringement in music and 3D assets, together with emerging mitigation strategies. Finally, we discuss how architectural trends, evaluation protocols, and governance mechanisms can be co-designed to close current capability and safety gaps, outlining critical paths toward more general-purpose, controllable, and accountable multimodal generative systems.

تم نشر مسح شامل حول النماذج التوليدية متعددة الوسائط (MGM)، يوضح تطورها من توليد النصوص إلى مجموعة متنوعة من أنماط الإخراج مثل الصور والموسيقى والفيديو. تصنف الدراسة ستة أنماط توليدية رئيسية وتناقش تقنيات أساسية مثل التعلم الذاتي والإشراف والتفكير المتسلسل التي تمكن من القدرات عبر الوسائط.

Se ha publicado una encuesta completa sobre los Modelos Generativos Multimodales (MGM), que detalla su evolución de la generación de texto a diversas modalidades de salida como imágenes, música y video. El estudio categoriza seis modalidades generativas principales y discute técnicas fundamentales como el Aprendizaje Auto-Supervisado y el razonamiento por Cadena de Pensamientos que permiten capacidades intermodales.

Une enquête complète sur les Modèles Génératifs Multimodaux (MGM) a été publiée, détaillant leur évolution de la génération de texte à diverses modalités de sortie telles que les images, la musique et la vidéo. L'étude catégorise six modalités génératives principales et discute des techniques fondamentales comme l'Apprentissage Auto-Supervisé et le raisonnement par Chaîne de Pensées qui permettent des capacités intermodales.

A comprehensive survey on Multimodal Generative Models (MGMs) has been published, detailing their evolution from text generation to various output modalities such as images, music, and video. The study categorizes six primary generative modalities and discusses foundational techniques like Self-Supervised Learning and Chain-of-Thought prompting that enable cross-modal capabilities.

MCMoE: Completing Missing Modalities with Mixture of Experts for Incomplete Multimodal Action Quality Assessment

Was this article worth reading? Share it