Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning

arXiv — cs.CL•Thursday, November 13, 2025 at 5:00:00 AM

The recent study on stabilizing reinforcement learning for honesty alignment in language models highlights the challenges faced in training models to handle deductive reasoning tasks effectively. While reinforcement learning with verifiable rewards (RLVR) shows promise in aligning language models with complex reasoning objectives, existing methods often fail when negative rewards dominate early training. This research introduces two multi-step deductive reasoning datasets, focusing on linear algebra and logical inference, and reveals that the GRPO method struggles with these tasks. To address these issues, the study proposes the Anchor method, which injects ground truth trajectories into rollouts, demonstrating its effectiveness in stabilizing training. Additionally, the research indicates that curriculum learning can provide benefits but requires carefully designed datasets. This work is crucial as it not only advances the understanding of honesty alignment in AI but also lays the gro…

— via World Pulse Now AI Editorial System

Read Original

Was this article worth reading? Share it

Recommended Readings

arXiv — cs.LGa day ago

On the Entropy Calibration of Language Models

NeutralArtificial Intelligence

The paper examines entropy calibration in language models, focusing on whether their entropy aligns with log loss on human text. Previous studies indicated that as text generation lengthens, entropy increases while text quality declines, highlighting a fundamental issue in autoregressive models. The authors investigate whether miscalibration can improve with scale and if calibration without tradeoffs is theoretically feasible, analyzing the scaling behavior concerning dataset size and power law exponents.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm

PositiveArtificial Intelligence

The paper titled 'Potent but Stealthy: Rethink Profile Pollution against Sequential Recommendation via Bi-level Constrained Reinforcement Paradigm' addresses vulnerabilities in sequential recommenders, particularly to adversarial attacks. It highlights the Profile Pollution Attack (PPA), which subtly contaminates user interactions to induce mispredictions. The authors propose a new method called CREAT, which combines bi-level optimization with reinforcement learning to enhance the stealthiness and effectiveness of such attacks, overcoming limitations of previous methods.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

LDC: Learning to Generate Research Idea with Dynamic Control

PositiveArtificial Intelligence

Recent advancements in large language models (LLMs) highlight their potential in automating scientific research ideation. Current methods often produce ideas that do not meet expert standards of novelty, feasibility, and effectiveness. To address these issues, a new framework is proposed that combines Supervised Fine-Tuning (SFT) and controllable Reinforcement Learning (RL) to enhance the quality of generated research ideas through a two-stage approach.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning

PositiveArtificial Intelligence

The paper titled 'Behaviour Policy Optimization: Provably Lower Variance Return Estimates for Off-Policy Reinforcement Learning' addresses the challenges of high-variance return estimates in reinforcement learning algorithms. It highlights that well-designed behavior policies can collect off-policy data, leading to lower variance return estimates. This finding suggests that on-policy data collection is not optimal for variance, and the authors extend this insight to online reinforcement learning, where policy evaluation and improvement occur simultaneously.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

Mining--Gym: A Configurable RL Benchmarking Environment for Truck Dispatch Scheduling

PositiveArtificial Intelligence

Mining-Gym is introduced as a configurable, open-source benchmarking environment aimed at optimizing truck dispatch scheduling in mining operations. The dynamic and stochastic nature of mining environments, characterized by uncertainties such as equipment failures and variable haul cycle times, poses challenges to traditional optimization methods. By leveraging Reinforcement Learning (RL), Mining-Gym provides a platform for training, testing, and evaluating RL algorithms, enhancing the efficiency and adaptability of decision-making in mining logistics.

Read full article

via arXiv — cs.LG

arXiv — cs.LG2 days ago

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

PositiveArtificial Intelligence

The article discusses the reconciliation of two distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning, specifically direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. It reveals that these methods are two sides of the same coin and interprets hard-example up-weighting modifications as reward-level regularization. Additionally, it provides a recipe for deriving both existing and new advantage-shaping methods, offering insights into RLVR policy gradient optimization beyond the initial focus on Pass@K.

Read full article

via arXiv — cs.LG

arXiv — cs.CL2 days ago

From Efficiency to Adaptivity: A Deeper Look at Adaptive Reasoning in Large Language Models

NeutralArtificial Intelligence

Recent advancements in large language models (LLMs) have shifted the focus of reasoning as a benchmark for intelligence evaluation. This article critiques the uniform reasoning strategies employed by current LLMs, which often generate lengthy reasoning for simple tasks while struggling with complex ones. It introduces the concept of adaptive reasoning, which allows models to adjust their reasoning efforts based on task difficulty and uncertainty, and outlines three key contributions to understanding this approach.

Read full article

via arXiv — cs.CL

arXiv — cs.LG2 days ago

DiAReL: Reinforcement Learning with Disturbance Awareness for Robust Sim2Real Policy Transfer in Robot Control

PositiveArtificial Intelligence

The paper titled 'DiAReL: Reinforcement Learning with Disturbance Awareness for Robust Sim2Real Policy Transfer in Robot Control' discusses the introduction of a disturbance-augmented Markov decision process (DAMDP) to enhance reinforcement learning in robotic control. It addresses the challenges of sim2real transfer, where models trained in simulation often fail to perform effectively in real-world scenarios due to discrepancies in system dynamics. The proposed approach aims to improve the robustness and stabilization of control responses in robotic systems.

Read full article

via arXiv — cs.LG