Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

arXiv — cs.LGWednesday, November 5, 2025 at 5:00:00 AM
A recent study published on arXiv introduces a novel training method for Mixture of Experts (MoE) models that employs dense backpropagation to provide dense gradient updates. This approach addresses a key challenge in MoE pretraining, where sparse gradient updates can hinder training stability and overall performance. By enabling more comprehensive gradient flow during training, the method improves the effectiveness of MoE models, which are known for their sparse expert activation mechanisms. The advancement holds promise for enhancing the robustness and efficiency of large-scale machine learning models that rely on MoE architectures. This development aligns with ongoing research efforts to optimize transformer-based models and their training dynamics. While further validation is needed, the proposed dense backpropagation technique represents a significant step forward in overcoming limitations associated with sparse updates in MoE training.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended apps based on your readingExplore all apps
Continue Readings
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
PositiveArtificial Intelligence
The introduction of softpick, a novel drop-in replacement for softmax in transformer attention mechanisms, addresses issues of attention sink and massive activations, achieving a consistent 0% sink rate in experiments with large models. This advancement allows for the production of hidden states with lower kurtosis and sparser attention maps.

Ready to build your own newsroom?

Subscribe to unlock a personalised feed, podcasts, newsletters, and notifications tailored to the topics you actually care about