arXiv:2511.17473v1 Announce Type: cross 
Abstract: Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

تمثل المقدمة الحديثة للاكتفاء الذاتي المقنع والمعاد ترتيبها للتعلم المعزز من المكافآت القابلة للتحقق (MR-RLVR) هدفًا لتحسين قدرات التفكير الرياضي لنماذج اللغة الكبيرة (LLMs) من خلال استخدام مكافآت ذاتية الإشراف على مستوى العملية. تتناول هذه الطريقة قيود النماذج الحالية في التعامل مع التفكير الوسيط والتحقق من الإجابات النهائية، خاصة في إثبات النظريات.

La reciente introducción de la auto-supervisión enmascarada y reordenada para el aprendizaje por refuerzo a partir de recompensas verificables (MR-RLVR) tiene como objetivo mejorar las capacidades de razonamiento matemático de los grandes modelos de lenguaje (LLMs) mediante el uso de recompensas auto-supervisadas a nivel de proceso. Este enfoque aborda las limitaciones de los modelos existentes en el manejo del razonamiento intermedio y la verificación de respuestas finales, especialmente en la demostración de teoremas.

La récente introduction de l'auto-supervision masquée et réordonnée pour l'apprentissage par renforcement à partir de récompenses vérifiables (MR-RLVR) vise à améliorer les capacités de raisonnement mathématique des grands modèles de langage (LLMs) en utilisant des récompenses auto-supervisées au niveau du processus. Cette approche répond aux limites des modèles existants dans la gestion du raisonnement intermédiaire et de la vérification des réponses finales, en particulier dans la démonstration de théorèmes.

The recent introduction of Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards (MR-RLVR) aims to enhance the mathematical reasoning capabilities of large language models (LLMs) by utilizing process-level self-supervised rewards. This approach addresses the limitations of existing models in handling intermediate reasoning and verification of final answers, particularly in theorem proving.

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Was this article worth reading? Share it

CodeSpaced

LucidQuery AI

Republiclabs.ai