arXiv:2511.17885v1 Announce Type: cross 
Abstract: Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to ease computational/memory burdens while preserving performance, enabling MLLM deployment in resource-constrained or latency-sensitive scenarios. Current visual token pruning methods mainly rely on attention-based redundancy analysis and are tailored to dense architectures. We propose Fast Multimodal Mixture-of-Experts (FastMMoE), a training-free acceleration framework for mixture-of-experts (MoE) based MLLMs, developed from a routing analysis perspective. FastMMoE combines two complementary strategies: (i) expert activation reduction for visual tokens to minimize unnecessary expert computation; and (ii) routing-aware token pruning that leverages similarity in routing probability distributions to identify and remove highly redundant visual tokens. Experiments on large-scale MoE-MLLMs such as DeepSeek-VL2 and InternVL3.5 demonstrate that FastMMoE can reduce FLOPs by up to 55.0% while retaining approximately 95.5% of the original performance, consistently outperforming dense-model pruning baselines including FastV and SparseVLM across multiple retention rates.

تم تقديم FastMMoE كإطار تسريع بدون تدريب لنماذج اللغة متعددة الوسائط (MLLMs)، حيث يتناول التحديات التي تطرحها المدخلات البصرية عالية الدقة التي تؤدي إلى تسلسلات طويلة من الرموز البصرية وزيادة في زمن الاستدلال. يستخدم هذا الإطار تقليل تنشيط الخبراء وتقليم الرموز الواعي بالتوجيه لتحسين الأداء دون المساس بالكفاءة.

FastMMoE se ha presentado como un marco de aceleración sin entrenamiento para modelos de lenguaje multimodal (MLLM), abordando los desafíos que plantean las entradas visuales de alta resolución que conducen a largas secuencias de tokens visuales y un aumento en la latencia de inferencia. Este marco emplea la reducción de activación de expertos y el pruning de tokens consciente del enrutamiento para optimizar el rendimiento sin comprometer la eficiencia.

FastMMoE a été présenté comme un cadre d'accélération sans entraînement pour les modèles de langage multimodaux (MLLM), répondant aux défis posés par les entrées visuelles haute résolution qui entraînent de longues séquences de jetons visuels et une latence d'inférence accrue. Ce cadre utilise la réduction de l'activation des experts et le pruning des jetons conscient du routage pour optimiser la performance sans compromettre l'efficacité.

FastMMoE has been introduced as a training-free acceleration framework for multimodal large language models (MLLMs), addressing the challenges posed by high-resolution visual inputs that lead to lengthy sequences of visual tokens and increased inference latency. This framework employs expert activation reduction and routing-aware token pruning to optimize performance without compromising efficiency.

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

arXiv:2511.15986v2 Announce Type: replace-cross 
Abstract: Multimodal large language models (MLLMs) have shown strong potential for medical image reasoning, yet fairness across demographic groups remains a major concern. Existing debiasing methods often rely on large labeled datasets or fine-tuning, which are impractical for foundation-scale models. We explore In-Context Learning (ICL) as a lightweight, tuning-free alternative for improving fairness. Through systematic analysis, we find that conventional demonstration selection (DS) strategies fail to ensure fairness due to demographic imbalance in selected exemplars. To address this, we propose Fairness-Aware Demonstration Selection (FADS), which builds demographically balanced and semantically relevant demonstrations via clustering-based sampling. Experiments on multiple medical imaging benchmarks show that FADS consistently reduces gender-, race-, and ethnicity-related disparities while maintaining strong accuracy, offering an efficient and scalable path toward fair medical image reasoning. These results highlight the potential of fairness-aware in-context learning as a scalable and data-efficient solution for equitable medical image reasoning.

تسلط التطورات الأخيرة في نماذج اللغة متعددة الوسائط (MLLM) الضوء على أهمية العدالة في التفكير في الصور الطبية، كما يتضح من تقديم اختيار العرض الواعي للعدالة (FADS). تهدف هذه الطريقة إلى التخفيف من عدم التوازن الديموغرافي في تدريب النماذج من خلال استخدام أخذ العينات القائم على التجميع لإنشاء عروض متوازنة وذات صلة.

Los recientes avances en modelos de lenguaje multimodal (MLLM) destacan la importancia de la equidad en el razonamiento de imágenes médicas, como lo demuestra la introducción de la Selección de Demostración Consciente de la Equidad (FADS). Este método busca mitigar los desequilibrios demográficos en el entrenamiento de modelos mediante el uso de muestreo basado en agrupamiento para crear demostraciones equilibradas y relevantes.

Les avancées récentes dans les modèles de langage multimodaux (MLLM) soulignent l'importance de l'équité dans le raisonnement d'image médicale, comme le montre l'introduction de la Sélection de Démonstration Sensible à l'Équité (FADS). Cette méthode vise à atténuer les déséquilibres démographiques dans l'entraînement des modèles en utilisant un échantillonnage basé sur le clustering pour créer des démonstrations équilibrées et pertinentes.

Recent advancements in multimodal large language models (MLLMs) highlight the importance of fairness in medical image reasoning, as demonstrated by the introduction of Fairness-Aware Demonstration Selection (FADS). This method aims to mitigate demographic imbalances in model training by utilizing clustering-based sampling to create balanced and relevant demonstrations.

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Was this article worth reading? Share it

ModelsLab

Langfuse

MicroEstimates