MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

arXiv — cs.LGWednesday, November 12, 2025 at 5:00:00 AM
Murphy represents a notable advancement in the field of artificial intelligence, particularly in enhancing the reasoning capabilities of large language models through its multi-turn reflective optimization framework. By building on the existing Group Relative Policy Optimization (GRPO) approach, Murphy incorporates iterative self-correction, allowing models to refine their reasoning progressively. This is crucial for tasks that require complex decision-making, which GRPO has struggled with. Evaluations conducted on code generation benchmarks using model families such as Qwen and OLMo demonstrate that Murphy consistently outperforms GRPO, achieving up to an 8% relative gain in pass@1 metrics. This improvement not only showcases the effectiveness of Murphy but also highlights the potential for further advancements in reinforcement learning frameworks, particularly those utilizing verifiable rewards to enhance model performance.
— via World Pulse Now AI Editorial System

Was this article worth reading? Share it

Recommended Readings
Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients
PositiveArtificial Intelligence
The article discusses the reconciliation of two distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning, specifically direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. It reveals that these methods are two sides of the same coin and interprets hard-example up-weighting modifications as reward-level regularization. Additionally, it provides a recipe for deriving both existing and new advantage-shaping methods, offering insights into RLVR policy gradient optimization beyond the initial focus on Pass@K.