PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

The introduction of Process Relative Policy Optimization (PRPO) aims to enhance policy optimization for large language models (LLMs) by aligning process rewards with outcome rewards, addressing the limitations of existing critic-free methods like GRPO. PRPO provides a more nuanced approach by segmenting reasoning sequences and normalizing feedback, which improves the accuracy of models such as Qwen2.5-Math-1.5B on tasks like MATH500.
This development is significant as it demonstrates a shift towards more effective reinforcement learning techniques that can yield better performance in multi-step reasoning tasks, overcoming challenges posed by sparse reward signals. The improvement from 61.2% to 64.4% accuracy with minimal rollouts highlights the potential for more efficient training methodologies in AI.
The evolution of policy optimization methods reflects a broader trend in AI research towards enhancing model stability and performance through innovative frameworks. As various approaches like DVPO, GTPO, and GAPO emerge, they collectively address issues such as reward distribution and training effectiveness, indicating a concerted effort to refine reinforcement learning strategies for LLMs.

PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization