arXiv:2508.05337v2 Announce Type: replace-cross 
Abstract: Recent Large Reasoning Language Models (LRLMs) employ long chain-of-thought reasoning with complex reflection behaviors, typically signaled by specific trigger words (e.g., "Wait" and "Alternatively") to enhance performance. However, these reflection behaviors can lead to the overthinking problem where the generation of redundant reasoning steps that unnecessarily increase token usage, raise inference costs, and reduce practical utility. In this paper, we propose Certainty-Guided Reflection Suppression (CGRS), a novel method that mitigates overthinking in LRLMs while maintaining reasoning accuracy. CGRS operates by dynamically suppressing the model's generation of reflection triggers when it exhibits high confidence in its current response, thereby preventing redundant reflection cycles without compromising output quality. Our approach is model-agnostic, requires no retraining or architectural modifications, and can be integrated seamlessly with existing autoregressive generation pipelines. Extensive experiments across four reasoning benchmarks (i.e., AIME24, AMC23, MATH500, and GPQA-D) demonstrate CGRS's effectiveness: it reduces token usage by an average of 18.5% to 41.9% while preserving accuracy. It also achieves the optimal balance between length reduction and performance compared to state-of-the-art baselines. These results hold consistently across model architectures (e.g., DeepSeek-R1-Distill series, QwQ-32B, and Qwen3 family) and scales (4B to 32B parameters), highlighting CGRS's practical value for efficient reasoning.

أدت التطورات الأخيرة في نماذج اللغة الكبيرة للسبب (LRLMs) إلى إدخال سلوكيات تفكير معقدة تعزز الأداء، ولكنها قد تؤدي أيضًا إلى الإفراط في التفكير. يؤدي هذا الإفراط في التفكير إلى خطوات تفكير زائدة، مما يزيد من استخدام الرموز وتكاليف الاستدلال. يقدم المقال طريقة جديدة تُسمى قمع التفكير الموجه باليقين (CGRS)، والتي تخفف هذه المشكلة من خلال قمع المحفزات الانعكاسية ديناميكيًا عندما يكون النموذج واثقًا من استجابته الحالية. هذه الطريقة مستقلة عن النموذج ولا تتطلب إعادة تدريب أو تعديلات معمارية.

Los recientes avances en los Modelos de Lenguaje de Razonamiento Grande (LRLM) han introducido comportamientos de reflexión complejos que mejoran el rendimiento, pero también pueden llevar a un exceso de reflexión. Este exceso de reflexión resulta en pasos de razonamiento redundantes, aumentando el uso de tokens y los costos de inferencia. El artículo presenta un nuevo método llamado Supresión de Reflexión Guiada por la Certeza (CGRS), que mitiga este problema al suprimir dinámicamente los desencadenantes de reflexión cuando el modelo tiene confianza en su respuesta. Este enfoque es independie…

Des avancées récentes dans les grands modèles de langage de raisonnement (LRLM) ont introduit des comportements de réflexion complexes qui améliorent les performances, mais peuvent également entraîner une sur-réflexion. Cette sur-réflexion entraîne des étapes de raisonnement redondantes, augmentant l'utilisation des jetons et les coûts d'inférence. L'article présente une nouvelle méthode appelée suppression de réflexion guidée par la certitude (CGRS), qui atténue ce problème en supprimant dynamiquement les déclencheurs de réflexion lorsque le modèle est confiant dans sa réponse. Cette approche…

Recent advancements in Large Reasoning Language Models (LRLMs) have introduced complex reflection behaviors that enhance performance but can also lead to overthinking. This overthinking results in redundant reasoning steps, increasing token usage and inference costs. The paper presents a new method called Certainty-Guided Reflection Suppression (CGRS), which mitigates this issue by dynamically suppressing reflection triggers when the model is confident in its response. This approach is model-agnostic and does not require retraining or architectural changes.

Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression

arXiv:2511.17473v1 Announce Type: cross 
Abstract: Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets where only outcomes are verifiable. We implement MR-RLVR on Qwen2.5-3B and DeepSeek-R1-Distill-Qwen-1.5B, and evaluate on AIME24, AIME25, AMC23, and MATH500. Under a fixed sampling and decoding budget, MR-RLVR achieves average relative gains over the original RLVR of +9.86% Pass@1, +5.27% Pass@5, and +4.00% Pass@8. These results indicate that incorporating process-aware self-supervised signals can effectively enhance RLVR's scalability and performance in only outcome-verifiable settings.

تمثل المقدمة الحديثة للاكتفاء الذاتي المقنع والمعاد ترتيبها للتعلم المعزز من المكافآت القابلة للتحقق (MR-RLVR) هدفًا لتحسين قدرات التفكير الرياضي لنماذج اللغة الكبيرة (LLMs) من خلال استخدام مكافآت ذاتية الإشراف على مستوى العملية. تتناول هذه الطريقة قيود النماذج الحالية في التعامل مع التفكير الوسيط والتحقق من الإجابات النهائية، خاصة في إثبات النظريات.

La reciente introducción de la auto-supervisión enmascarada y reordenada para el aprendizaje por refuerzo a partir de recompensas verificables (MR-RLVR) tiene como objetivo mejorar las capacidades de razonamiento matemático de los grandes modelos de lenguaje (LLMs) mediante el uso de recompensas auto-supervisadas a nivel de proceso. Este enfoque aborda las limitaciones de los modelos existentes en el manejo del razonamiento intermedio y la verificación de respuestas finales, especialmente en la demostración de teoremas.

La récente introduction de l'auto-supervision masquée et réordonnée pour l'apprentissage par renforcement à partir de récompenses vérifiables (MR-RLVR) vise à améliorer les capacités de raisonnement mathématique des grands modèles de langage (LLMs) en utilisant des récompenses auto-supervisées au niveau du processus. Cette approche répond aux limites des modèles existants dans la gestion du raisonnement intermédiaire et de la vérification des réponses finales, en particulier dans la démonstration de théorèmes.

The recent introduction of Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards (MR-RLVR) aims to enhance the mathematical reasoning capabilities of large language models (LLMs) by utilizing process-level self-supervised rewards. This approach addresses the limitations of existing models in handling intermediate reasoning and verification of final answers, particularly in theorem proving.

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

arXiv:2511.07772v2 Announce Type: replace-cross 
Abstract: As Large Language Models (LLMs) evolve into personal assistants with access to sensitive user data, they face a critical privacy challenge: while prior work has addressed output-level privacy, recent findings reveal that LLMs often leak private information through their internal reasoning processes, violating contextual privacy expectations. These leaky thoughts occur when models inadvertently expose sensitive details in their reasoning traces, even when final outputs appear safe. The challenge lies in preventing such leakage without compromising the model's reasoning capabilities, requiring a delicate balance between privacy and utility. We introduce Steering Activations towards Leakage-free Thinking (SALT), a lightweight test-time intervention that mitigates privacy leakage in model's Chain of Thought (CoT) by injecting targeted steering vectors into hidden state. We identify the high-leakage layers responsible for this behavior. Through experiments across multiple LLMs, we demonstrate that SALT achieves reductions including $18.2\%$ reduction in CPL on QwQ-32B, $17.9\%$ reduction in CPL on Llama-3.1-8B, and $31.2\%$ reduction in CPL on Deepseek in contextual privacy leakage dataset AirGapAgent-R while maintaining comparable task performance and utility. Our work establishes SALT as a practical approach for test-time privacy protection in reasoning-capable language models, offering a path toward safer deployment of LLM-based personal agents.

يقدم نظام Steering Activations towards Leakage-free Thinking (SALT) حلاً لتحدي الخصوصية الحرج الذي تواجهه نماذج اللغة الكبيرة (LLMs)، والتي غالبًا ما تتسرب منها معلومات حساسة من خلال عمليات التفكير الداخلية. يهدف SALT إلى تقليل هذه التسريبات من خلال حقن متجهات توجيه مستهدفة في الحالات المخفية للنموذج، مما يضمن الحفاظ على قدرات التفكير مع تعزيز الخصوصية.

La introducción de Steering Activations towards Leakage-free Thinking (SALT) aborda un desafío crítico de privacidad que enfrentan los Modelos de Lenguaje Grande (LLMs), que a menudo filtran información sensible a través de sus procesos internos de razonamiento. SALT busca mitigar esta filtración inyectando vectores de dirección específicos en los estados ocultos del modelo, asegurando que las capacidades de razonamiento se conserven mientras se mejora la privacidad.

L'introduction de Steering Activations towards Leakage-free Thinking (SALT) répond à un défi de confidentialité crucial auquel sont confrontés les grands modèles de langage (LLMs), qui fuient souvent des informations sensibles à travers leurs processus de raisonnement internes. SALT vise à atténuer cette fuite en injectant des vecteurs de guidage ciblés dans les états cachés du modèle, garantissant que les capacités de raisonnement sont préservées tout en améliorant la confidentialité.

The introduction of Steering Activations towards Leakage-free Thinking (SALT) addresses a critical privacy challenge faced by Large Language Models (LLMs), which often leak sensitive information through their internal reasoning processes. SALT aims to mitigate this leakage by injecting targeted steering vectors into the model's hidden states, ensuring that the reasoning capabilities are preserved while enhancing privacy.

Efficient Reasoning for Large Reasoning Language Models via Certainty-Guided Reflection Suppression

Was this article worth reading? Share it

Sellm

GPTHumanizer

MindPrism AI