arXiv:2511.11505v1 Announce Type: new 
Abstract: Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged across a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks.

يقدم المقال FarSkip-Collective، وهو نهج جديد يهدف إلى التغلب على اختناقات الاتصال في نماذج مزيج الخبراء (MoE). من خلال تعديل الهيكل المعماري للسماح بتداخل الحساب مع الاتصال، تم تطبيق هذه الطريقة بنجاح على مجموعة من النماذج المتطورة، بما في ذلك Llama 4 Scout، الذي يحتوي على 109 مليار معلمة. تشير النتائج إلى أن النماذج المعدلة تحافظ على مستويات دقة مقارنة بإصداراتها الأصلية، حيث تحقق دقة متوسطة ضمن 1% من الإصدار المعدل حسب التعليمات.

El artículo presenta FarSkip-Collective, un nuevo enfoque destinado a superar los cuellos de botella en la comunicación en los modelos de Mezcla de Expertos (MoE). Al modificar la arquitectura para permitir la superposición de la computación con la comunicación, este método se ha aplicado con éxito a varios modelos de vanguardia, incluido Llama 4 Scout, que tiene 109 mil millones de parámetros. Los resultados indican que los modelos modificados mantienen niveles de precisión comparables a sus versiones originales, logrando una precisión promedio dentro del 1% de la versión ajustada por instruc…

L'article présente FarSkip-Collective, une nouvelle approche visant à surmonter les goulets d'étranglement de communication dans les modèles de Mixture of Experts (MoE). En modifiant l'architecture pour permettre le chevauchement de la computation avec la communication, cette méthode a été appliquée avec succès à divers modèles de pointe, y compris Llama 4 Scout, qui compte 109 milliards de paramètres. Les résultats indiquent que les modèles modifiés conservent des niveaux de précision comparables à leurs versions originales, atteignant une précision moyenne dans un pourcentage de 1 % de la ve…

The article introduces FarSkip-Collective, a new approach aimed at overcoming the communication bottlenecks in Mixture of Experts (MoE) models. By modifying the architecture to allow overlapping computation with communication, this method has been successfully applied to various state-of-the-art models, including Llama 4 Scout, which has 109 billion parameters. The results indicate that the modified models maintain accuracy levels comparable to their original versions, achieving an average accuracy within 1% of the instruction-tuned release.

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

Was this article worth reading? Share it

Ready to build your own newsroom?