arXiv:2510.21007v2 Announce Type: replace 
Abstract: Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models (LLMs). While extended reasoning can boost accuracy on complex tasks, it is often unnecessary and substantially increases token usage, limiting the practicality of reasoning models in many scenarios. Recent models, such as GPT-OSS and Qwen3, expose controls that enable users to adjust the length of CoT or determine whether it is used at all. Yet, it remains unclear when CoT should be used: on some tasks it improves performance, while on others it provides little benefit or even harms performance. We address this challenge with confidence-gated CoT, where a model invokes reasoning only when confidence in its direct answer is low. To this end, we present the first systematic study of training-free confidence estimation methods for CoT gating. Specifically, we evaluate four training-free confidence estimation methods and compare them to a random baseline and an oracle that always knows when CoT is needed. Through extensive experiments, we show that existing training-free confidence measures can reduce redundant CoT and outperform randomly invoked CoT. However, the utility of individual confidence measures is inconsistent, varying with both the dataset and the model, underscoring the difficulty of deploying confidence-gated CoT in practice. By analysing both strengths and failure modes, our study highlights the potential and limitations of current methods and paves the way toward more reliable adaptive gating of CoT.

تتناول دراسة حديثة استخدام تقنية سلسلة التفكير (CoT) في نماذج اللغة الكبيرة (LLMs) مثل GPT-OSS و Qwen3. بينما يمكن أن تعزز تقنية CoT التفكير والدقة في المهام المعقدة، فإنها غالبًا ما تؤدي إلى استخدام غير ضروري للتوكنات، مما قد يعيق التطبيقات العملية. تبرز الدراسة أهمية تقديرات الثقة في تحديد متى يكون استخدام CoT ضروريًا حقًا، بهدف تحسين التوازن بين عمق التفكير والكفاءة. هذا الأمر مهم لأنه قد يحسن من قابلية استخدام نماذج اللغة الكبيرة في سيناريوهات متعددة، مما يجعلها أدوات أكثر فعالية للمستخدمين.

Un estudio reciente discute el uso de la técnica de cadena de pensamiento (CoT) en modelos de lenguaje grandes (LLMs) como GPT-OSS y Qwen3. Aunque el CoT puede mejorar el razonamiento y la precisión en tareas complejas, a menudo conduce a un uso innecesario de tokens, lo que puede obstaculizar las aplicaciones prácticas. La investigación destaca la importancia de las estimaciones de confianza para decidir cuándo es realmente necesario el CoT, buscando optimizar el equilibrio entre la profundidad del razonamiento y la eficiencia. Esto es significativo, ya que podría mejorar la usabilidad de los LLMs en varios escenarios, convirtiéndolos en herramientas más efectivas para los usuarios.

Une étude récente aborde l'utilisation de l'invite de chaîne de pensée (CoT) dans les grands modèles de langage (LLMs) tels que GPT-OSS et Qwen3. Bien que le CoT puisse améliorer le raisonnement et la précision pour des tâches complexes, il entraîne souvent une utilisation excessive de jetons, ce qui peut entraver les applications pratiques. La recherche souligne l'importance des estimations de confiance pour décider quand le CoT est vraiment nécessaire, visant à optimiser l'équilibre entre la profondeur du raisonnement et l'efficacité. Cela est significatif car cela pourrait améliorer l'utilisabilité des LLMs dans divers scénarios, les rendant plus efficaces pour les utilisateurs.

A recent study discusses the use of chain-of-thought (CoT) prompting in large language models (LLMs) like GPT-OSS and Qwen3. While CoT can enhance reasoning and accuracy for complex tasks, it often leads to unnecessary token usage, which can hinder practical applications. The research highlights the importance of confidence estimates in deciding when CoT is truly needed, aiming to optimize the balance between reasoning depth and efficiency. This is significant as it could improve the usability of LLMs in various scenarios, making them more effective tools for users.

Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?

arXiv:2511.19333v1 Announce Type: new 
Abstract: Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal, turning this goal into a plan, working through intermediate steps, and checking their own work before answering . Frontier large language models with reasoning capabilities, such as DeepSeek-R1 and OpenAI's gpt-oss, follow the same procedure when solving complex problems by generating intermediate reasoning traces before giving the final answer. Today, these models are being increasingly used to generate reasoning traces that serve as high-quality supervised data for post-training of small and medium-sized language models to teach reasoning capabilities without requiring expensive human curation. In this work, we compare the performance of medium-sized LLMs on Math problems after post-training on two kinds of reasoning traces. We compare the impact of reasoning traces generated by DeepSeek-R1 and gpt-oss LLMs in terms of accuracy and inference efficiency.

أدت التطورات الأخيرة في نماذج اللغة الكبيرة (LLMs) إلى إدخال تقنيات توسيع وقت الاختبار التي تعزز قدرات التفكير، كما يتضح من نماذج DeepSeek-R1 وgpt-oss من OpenAI. تقوم هذه النماذج بتوليد آثار تفكير وسيطة لتحسين الدقة في حل المشكلات المعقدة، مما يسمح بإعادة تدريب فعالة للنماذج الأصغر دون الحاجة إلى تدخل بشري مكثف.

Los avances recientes en modelos de lenguaje de gran tamaño (LLMs) han introducido técnicas de escalado en el momento de la prueba que mejoran las capacidades de razonamiento, como lo demuestran los modelos DeepSeek-R1 y gpt-oss de OpenAI. Estos modelos generan trazas de razonamiento intermedias para mejorar la precisión en la resolución de problemas complejos, lo que permite un post-entrenamiento efectivo de modelos más pequeños sin requerir una intervención humana extensa.

Les avancées récentes dans les modèles de langage de grande taille (LLMs) ont introduit des techniques d'échelle au moment du test qui améliorent les capacités de raisonnement, comme le montrent les modèles DeepSeek-R1 et gpt-oss d'OpenAI. Ces modèles génèrent des traces de raisonnement intermédiaires pour améliorer la précision dans la résolution de problèmes complexes, permettant ainsi un post-entraînement efficace de modèles plus petits sans nécessiter une intervention humaine extensive.

Recent advancements in large language models (LLMs) have introduced test-time scaling techniques that enhance reasoning capabilities, as demonstrated by models like DeepSeek-R1 and OpenAI's gpt-oss. These models generate intermediate reasoning traces to improve accuracy in solving complex problems, allowing for effective post-training of smaller models without extensive human input.

Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?

Was this article worth reading? Share it

GPTBox

Langtail

Agentcloud