arXiv:2511.18903v1 Announce Type: cross 
Abstract: Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric. However, prior studies have reported limited improvements from such curriculum-based pretraining strategies. This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate (LR) schedule. We find that while curriculum-based training substantially outperforms random shuffling when using a constant LR, its advantage diminishes under standard LR decay schedules. Our experiments show this incompatibility can be mitigated by two simple strategies: (1) employing a more moderate LR decay schedule, where the final LR is only moderately smaller than the peak LR, and (2) replacing LR decay with model averaging, i.e., computing a weighted average of the final few checkpoints. By combining these strategies, we improve the average score on a suite of standard benchmarks by 1.64% over random shuffling, without additional data refinement. Validated on 1.5B-parameter models trained over 30B tokens with various data-quality metrics, our findings call for a re-evaluation of curriculum-based LLM pretraining and underscore the potential of co-designing data curricula with optimization methods.

تسلط الأبحاث الحديثة الضوء على عدم كفاءة انخفاض معدل التعلم في التدريب القائم على المنهج لنماذج اللغة الكبيرة (LLMs)، مما يكشف أن هذا النهج يمكن أن يهدر البيانات عالية الجودة. تشير الدراسة إلى أنه بينما يكون التدريب القائم على المنهج مفيدًا بمعدل تعلم ثابت، فإن مزاياه تتضاءل تحت جداول الانخفاض القياسية.

Investigaciones recientes destacan las ineficiencias de la disminución de la tasa de aprendizaje en el preentrenamiento basado en el currículo para modelos de lenguaje de gran tamaño (LLMs), revelando que este enfoque puede desperdiciar datos de alta calidad. El estudio indica que, aunque el entrenamiento basado en el currículo es beneficioso con una tasa de aprendizaje constante, sus ventajas disminuyen bajo programas de disminución estándar.

Des recherches récentes mettent en évidence les inefficacités de la décadence du taux d'apprentissage dans le pré-entraînement basé sur le curriculum pour les modèles de langage de grande taille (LLMs), révélant que cette approche peut gaspiller des données de haute qualité. L'étude indique que, bien que l'entraînement basé sur le curriculum soit bénéfique avec un taux d'apprentissage constant, ses avantages diminuent sous des programmes de décadence standard.

Recent research highlights the inefficiencies of learning rate decay in curriculum-based pretraining for large language models (LLMs), revealing that this approach can waste high-quality data. The study indicates that while curriculum-based training is beneficial with a constant learning rate, its advantages diminish under standard decay schedules.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Was this article worth reading? Share it

FastML

Holocron

LucidQuery AI