arXiv:2512.05620v1 Announce Type: new 
Abstract: Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts to validate and replicate their successes have reported mixed results. To better understand the effectiveness of these optimizers at scale, in this work we investigate how to scale preconditioned optimizers via hyperparameter transfer, building on prior works such as $\mu$P. We study how the optimal learning rate and weight decay should scale with model width and depth for a wide range of optimizers, including Shampoo, SOAP, and Muon, accounting for the impact of commonly used techniques such as blocking and grafting. We find that scaling the learning rate according to $\mu$P improves transfer, but can still suffer from significant finite-width deviations that cause drifting optimal learning rates, which we show can be mitigated by blocking and explicit spectral normalization. For compute-optimal scaling, we find scaling independent weight decay as $1/\mathrm{width}$ is nearly optimal across optimizers. Applying these scaling rules, we show Muon and Shampoo consistently achieve $1.4\times$ and $1.3\times$ speedup over AdamW for training Llama-architecture language models of sizes ranging from $190$M to $1.4$B, whereas the speedup vanishes rapidly with scale under incorrect scaling. Based on these results and further ablations, we argue that studying optimal hyperparameter transfer is essential for reliably comparing optimizers at scale given a realistic tuning budget.

استكشفت الأبحاث الأخيرة توسيع نطاق المحسنات المعتمدة على المصفوفات، مثل Shampoo وSOAP وMuon، من خلال نقل المعلمات الفائقة، بهدف تحسين أدائها بما يتجاوز التجارب الصغيرة. تشير الدراسة إلى أن ضبط معدل التعلم وفقًا لمبادئ $$P يمكن أن يؤدي إلى تحسين قابلية النقل، على الرغم من أن التحديات لا تزال قائمة بسبب الانحرافات الناتجة عن عرض النطاق المحدود التي تؤثر على معدلات التعلم المثلى.

Investigaciones recientes han explorado la escalabilidad de los optimizadores precondicionados por matriz, como Shampoo, SOAP y Muon, a través de la transferencia de hiperparámetros, con el objetivo de mejorar su rendimiento más allá de los experimentos a pequeña escala. El estudio indica que ajustar la tasa de aprendizaje según los principios de $$P puede llevar a una mejor transferibilidad, aunque persisten desafíos debido a las desviaciones de ancho finito que afectan las tasas de aprendizaje óptimas.

Des recherches récentes ont exploré la mise à l'échelle des optimisateurs préconditionnés par matrice, tels que Shampoo, SOAP et Muon, via le transfert d'hyperparamètres, dans le but d'améliorer leurs performances au-delà des expériences à petite échelle. L'étude indique que l'ajustement du taux d'apprentissage selon les principes de $$P peut conduire à une meilleure transférabilité, bien que des défis demeurent en raison des écarts de largeur finie affectant les taux d'apprentissage optimaux.

Recent research has explored the scaling of matrix-preconditioned optimizers, such as Shampoo, SOAP, and Muon, through hyperparameter transfer, aiming to enhance their performance beyond small-scale experiments. The study indicates that adjusting the learning rate according to the principles of $$P can lead to improved transferability, although challenges remain due to finite-width deviations affecting optimal learning rates.

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

arXiv:2512.08217v1 Announce Type: new 
Abstract: Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto \gamma^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto \gamma^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto \gamma^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

دراسة حديثة تتحدى النهج التقليدي لوزن الانخفاض المنفصل في خوارزميات التحسين، حيث تتساءل عن الافتراض الذي ظل قائمًا لفترة طويلة بأن الوزن يجب أن يكون متناسبًا مع معدل التعلم. تقترح الأبحاث أن يكون التناسب مع مربع معدل التعلم أكثر ملاءمة، بناءً على حجج الاستقامة في الحالة الثابتة. ومع ذلك، تشير النتائج إلى تأثير ضئيل على ديناميات التدريب عند إزالة المكون العمودي للتحديثات.

Un estudio reciente desafía el enfoque convencional sobre la decadencia de peso desacoplada en los algoritmos de optimización, cuestionando la suposición de que debería ser proporcional a la tasa de aprendizaje. La investigación sugiere que una proporcionalidad al cuadrado de la tasa de aprendizaje podría ser más adecuada, basada en argumentos de ortogonalidad en estado estacionario. Sin embargo, los hallazgos indican un impacto mínimo en la dinámica de entrenamiento cuando se elimina la componente perpendicular de las actualizaciones.

Une étude récente remet en question l'approche conventionnelle de la décadence de poids découplée dans les algorithmes d'optimisation, en interrogeant l'hypothèse longtemps tenue selon laquelle elle devrait être proportionnelle au taux d'apprentissage. La recherche suggère qu'une proportionnalité au carré du taux d'apprentissage pourrait être plus appropriée, basée sur des arguments d'orthogonalité à l'état stationnaire. Cependant, les résultats indiquent un impact minimal sur la dynamique d'entraînement lorsque la composante perpendiculaire des mises à jour est supprimée.

A recent study challenges the conventional approach to decoupled weight decay in optimization algorithms, specifically questioning the long-held assumption that it should be proportional to the learning rate. The research suggests that a proportionality to the square of the learning rate may be more appropriate, based on steady-state orthogonality arguments. However, findings indicate minimal impact on training dynamics when the perpendicular component of updates is removed.

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Was this article worth reading? Share it

Portfolio Backtest

Solvice

Hypertune