arXiv:2512.06392v1 Announce Type: new 
Abstract: Reinforcement learning (RL) has emerged as the de-facto paradigm for improving the reasoning capabilities of large language models (LLMs). We have developed RLAX, a scalable RL framework on TPUs. RLAX employs a parameter-server architecture. A master trainer periodically pushes updated model weights to the parameter server while a fleet of inference workers pull the latest weights and generates new rollouts. We introduce a suite of system techniques to enable scalable and preemptible RL for a diverse set of state-of-art RL algorithms. To accelerate convergence and improve model quality, we have devised new dataset curation and alignment techniques. Large-scale evaluations show that RLAX improves QwQ-32B's pass@8 accuracy by 12.8% in just 12 hours 48 minutes on 1024 v5p TPUs, while remaining robust to preemptions during training.

تم تطوير RLAX كإطار تعلم معزز قابل للتوسع على وحدات TPU، مما يعزز قدرات التفكير لنماذج اللغة الكبيرة (LLMs). يستخدم بنية خادم المعلمات لإدارة أوزان النموذج بكفاءة وتوليد عمليات جديدة، محققًا تحسينًا ملحوظًا بنسبة 12.8% في دقة QwQ-32B pass@8 خلال فترة تدريب قصيرة مع الحفاظ على القوة ضد الانقطاعات.

RLAX se ha desarrollado como un marco de aprendizaje por refuerzo escalable en TPUs, mejorando las capacidades de razonamiento de los grandes modelos de lenguaje (LLMs). Utiliza una arquitectura de servidor de parámetros para gestionar de manera eficiente los pesos del modelo y generar nuevos rollouts, logrando una notable mejora del 12.8% en la precisión pass@8 de QwQ-32B en un corto período de entrenamiento, manteniendo la robustez frente a las preempciones.

RLAX a été développé comme un cadre d'apprentissage par renforcement évolutif sur des TPU, améliorant les capacités de raisonnement des grands modèles de langage (LLM). Il utilise une architecture de serveur de paramètres pour gérer efficacement les poids du modèle et générer de nouveaux rollouts, atteignant une amélioration notable de 12,8 % de la précision pass@8 de QwQ-32B en peu de temps tout en maintenant une robustesse face aux préemptions.

RLAX has been developed as a scalable reinforcement learning framework on TPUs, enhancing the reasoning capabilities of large language models (LLMs). It utilizes a parameter-server architecture to efficiently manage model weights and generate new rollouts, achieving a notable 12.8% improvement in QwQ-32B's pass@8 accuracy within a short training period while maintaining robustness against preemptions.

RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

arXiv:2505.11227v2 Announce Type: replace-cross 
Abstract: The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10\%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.

تشير الأبحاث الحديثة إلى أن النماذج اللغوية الكبيرة (LLMs) يمكن أن تعزز قدراتها على التفكير من خلال التعلم المعزز (RL) النقي الذي يركز على حل المشكلات، دون الحاجة إلى نماذج مكافأة العمليات (PRMs). تتحدى هذه النتيجة الاعتقاد التقليدي بأن PRMs ضرورية لتطوير مهارات التفكير في LLMs، كما يتضح من نموذج DeepSeek-R1.

Investigaciones recientes indican que los grandes modelos de lenguaje (LLMs) pueden mejorar sus capacidades de razonamiento a través del aprendizaje por refuerzo (RL) puro enfocado en la resolución de problemas, sin necesidad de modelos de recompensa de proceso (PRMs). Este hallazgo desafía la creencia tradicional de que los PRMs son esenciales para desarrollar habilidades de razonamiento en los LLMs, como lo demuestra el modelo DeepSeek-R1.

Des recherches récentes indiquent que les grands modèles de langage (LLMs) peuvent améliorer leurs capacités de raisonnement grâce à un apprentissage par renforcement (RL) pur axé sur la résolution de problèmes, sans avoir besoin de modèles de récompense de processus (PRMs). Cette découverte remet en question la croyance traditionnelle selon laquelle les PRMs sont essentiels au développement des compétences de raisonnement dans les LLMs, comme le montre le modèle DeepSeek-R1.

Recent research indicates that large language models (LLMs) can enhance their reasoning capabilities through pure reinforcement learning (RL) focused on problem-solving, without the need for process reward models (PRMs). This finding challenges the traditional belief that PRMs are essential for developing reasoning skills in LLMs, as demonstrated by the DeepSeek-R1 model.

RLAX: Large-Scale, Distributed Reinforcement Learning for Large Language Models on TPUs

Was this article worth reading? Share it

LucidQuery AI

Airparser

Agentcloud