arXiv:2601.08393v1 Announce Type: new 
Abstract: Scaling large models requires optimization strategies that ensure rapid convergence grounded in stability. Maximal Update Parametrization ($\boldsymbol{\mu}$P) provides a theoretical safeguard for width-invariant $\Theta(1)$ activation control, whereas emerging optimizers like Muon are only ``half-aligned'' with these constraints: they control updates but allow weights to drift. To address this limitation, we introduce the \textbf{Spectral Sphere Optimizer (SSO)}, which enforces strict module-wise spectral constraints on both weights and their updates. By deriving the steepest descent direction on the spectral sphere, SSO realizes a fully $\boldsymbol{\mu}$P-aligned optimization process. To enable large-scale training, we implement SSO as an efficient parallel algorithm within Megatron. Through extensive pretraining on diverse architectures, including Dense 1.7B, MoE 8B-A1B, and 200-layer DeepNet models, SSO consistently outperforms AdamW and Muon. Furthermore, we observe significant practical stability benefits, including improved MoE router load balancing, suppressed outliers, and strictly bounded activations.

تم تقديم استراتيجية جديدة للتفاؤل تُعرف باسم مُحسّن الكرة الطيفية (SSO) لتحسين تدريب نماذج اللغة الكبيرة (LLMs) من خلال فرض قيود طيفية صارمة على الأوزان والتحديثات، مما يعالج القيود الموجودة في المحسنين الحاليين مثل موون.

Se ha introducido una nueva estrategia de optimización llamada Spectral Sphere Optimizer (SSO) para mejorar el entrenamiento de grandes modelos de lenguaje (LLMs) al imponer estrictas restricciones espectrales sobre los pesos y las actualizaciones, abordando las limitaciones de optimizadores existentes como Muon.

Une nouvelle stratégie d'optimisation appelée Spectral Sphere Optimizer (SSO) a été introduite pour améliorer l'entraînement des grands modèles de langage (LLMs) en imposant des contraintes spectrales strictes sur les poids et les mises à jour, répondant aux limitations des optimiseurs existants comme Muon.

A new optimization strategy called the Spectral Sphere Optimizer (SSO) has been introduced to enhance the training of large language models (LLMs) by enforcing strict spectral constraints on weights and updates, addressing limitations found in existing optimizers like Muon.

Controlled LLM Training on Spectral Sphere

Was this article worth reading? Share it

LucidQuery AI

Magicley AI

Hypertune

Octofy

FastML

GPTHumanizer

Ready to build your own newsroom?