SSPO: Subsentence-level Policy Optimization
PositiveArtificial Intelligence
A recent study on post-training techniques for Large Language Models (LLMs) highlights the advancements made through Reinforcement Learning from Verifiable Reward (RLVR). This approach has notably enhanced the reasoning capabilities of LLMs. However, the research also points out challenges faced by certain RLVR algorithms, like GRPO and GSPO, which struggle with unstable policy updates and inefficient data usage. Understanding these issues is crucial as it can lead to more effective training methods, ultimately improving AI performance in various applications.
— via World Pulse Now AI Editorial System
