arXiv:2512.16928v1 Announce Type: new 
Abstract: The Muon optimizer enjoys strong empirical performance and theoretical grounding. However, the super-linear cost of its orthonormalization step introduces increasing overhead with scale. To alleviate this cost, several works have attempted to reduce the size of the matrix entering the orthonormalization step. We introduce Dion2, a much simpler method for shrinking the matrix involved in Muon's computation compared to prior approaches. At a high level, Dion2 selects a fraction of rows or columns at each iteration and orthonormalizes only those. This sampling procedure makes the update sparse, reducing both computation and communication costs which in turn improves the scalability of Muon.

يقدم إدخال Dion2 طريقة مبسطة لتقليل حجم المصفوفة في مُحسّن Muon، مما يعالج الحمل الحسابي المرتبط بخطوة التوحيد. تتضمن هذه الطريقة اختيار جزء من الصفوف أو الأعمدة للتوحيد في كل تكرار، مما يؤدي إلى تحديثات متفرقة تعزز قابلية التوسع.

La introducción de Dion2 presenta un método simplificado para reducir el tamaño de la matriz en el optimizador Muon, abordando la sobrecarga computacional asociada con su paso de ortonormalización. Este método implica seleccionar una fracción de filas o columnas para la ortonormalización en cada iteración, lo que lleva a actualizaciones dispersas que mejoran la escalabilidad.

L'introduction de Dion2 propose une méthode simplifiée pour réduire la taille de la matrice dans l'optimiseur Muon, abordant ainsi la surcharge computationnelle associée à son étape d'orthonormalisation. Cette méthode consiste à sélectionner une fraction de lignes ou de colonnes pour l'orthonormalisation à chaque itération, entraînant des mises à jour éparses qui améliorent l'évolutivité.

The introduction of Dion2 presents a simplified method for reducing the matrix size in the Muon optimizer, addressing the computational overhead associated with its orthonormalization step. This method involves selecting a fraction of rows or columns for orthonormalization at each iteration, leading to sparse updates that enhance scalability.

Dion2: A Simple Method to Shrink Matrix in Muon

arXiv:2601.08393v1 Announce Type: new 
Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbol{\mu}$P) provides a theoretical safeguard for width-invariant $\Theta(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbol{\mu}$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

تم تقديم استراتيجية جديدة للتفاؤل تُعرف باسم مُحسّن الكرة الطيفية (SSO) لتحسين تدريب نماذج اللغة الكبيرة (LLMs) من خلال فرض قيود طيفية صارمة على الأوزان والتحديثات، مما يعالج القيود الموجودة في المحسنين الحاليين مثل موون.

Se ha introducido una nueva estrategia de optimización llamada Spectral Sphere Optimizer (SSO) para mejorar el entrenamiento de grandes modelos de lenguaje (LLMs) al imponer estrictas restricciones espectrales sobre los pesos y las actualizaciones, abordando las limitaciones de optimizadores existentes como Muon.

Une nouvelle stratégie d'optimisation appelée Spectral Sphere Optimizer (SSO) a été introduite pour améliorer l'entraînement des grands modèles de langage (LLMs) en imposant des contraintes spectrales strictes sur les poids et les mises à jour, répondant aux limitations des optimiseurs existants comme Muon.

A new optimization strategy called the Spectral Sphere Optimizer (SSO) has been introduced to enhance the training of large language models (LLMs) by enforcing strict spectral constraints on weights and updates, addressing limitations found in existing optimizers like Muon.

Dion2: A Simple Method to Shrink Matrix in Muon

Was this article worth reading? Share it

Monkt

Zemith-3bda3b

FastML

Dyad

MIA APP

Bifrost

Ready to build your own newsroom?