arXiv:2502.15938v2 Announce Type: replace-cross 
Abstract: LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal peak LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. D2Z is superior across a range of model sizes, batch sizes, datasets, and vocabularies. Benefits increase as dataset size increases. Leveraging a novel interpretation of AdamW as an exponential moving average of weight updates, we show how linear D2Z optimally balances the demands of early training (moving away from initial conditions) and late training (averaging over more updates in order to mitigate gradient noise). In experiments, a 610M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% compute savings. Models such as Llama2-7B, trained for 286 TPP with 10x decay, could likely have saved a majority of compute by training with D2Z.

أظهرت دراسة تجريبية واسعة النطاق أن جدول معدل التعلم المتناقص خطيًا إلى الصفر (D2Z) يتفوق باستمرار على الطرق التقليدية، مثل التناقص الكوسيني، في تدريب نماذج اللغة الكبيرة (LLMs). هذه النتيجة مهمة بشكل خاص عند التدريب على أحجام بيانات مثالية من حيث الحوسبة، حيث تزداد فوائد D2Z مع زيادة حجم مجموعة البيانات.

Un estudio empírico a gran escala ha demostrado que un programa de tasa de aprendizaje de decaimiento lineal a cero (D2Z) supera consistentemente a los métodos tradicionales, como el decaimiento coseno, en el entrenamiento de modelos de lenguaje de gran tamaño (LLMs). Este hallazgo es especialmente significativo al entrenar con tamaños de conjuntos de datos óptimos para el cálculo, donde los beneficios del D2Z aumentan con el tamaño del conjunto de datos.

Une étude empirique à grande échelle a démontré qu'un calendrier de taux d'apprentissage linéaire décroissant à zéro (D2Z) surpasse systématiquement les méthodes traditionnelles, telles que la décroissance cosinus, dans l'entraînement des modèles de langage de grande taille (LLMs). Ce constat est particulièrement significatif lors de l'entraînement à des tailles de jeux de données optimales en calcul, où les avantages du D2Z augmentent avec la taille des données.

A large-scale empirical study has demonstrated that a linear decay-to-zero (D2Z) learning rate schedule consistently outperforms traditional methods, such as cosine decay, in training large language models (LLMs). This finding is particularly significant when training at compute-optimal dataset sizes, where the benefits of D2Z increase with dataset size.

Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

Was this article worth reading? Share it

CodeSpaced

Litlyx

The Visualizer