arXiv:2511.20718v1 Announce Type: new 
Abstract: PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

تهدف مقدمة ST-PPO، النسخة المستقرة من تحسين السياسة القريبة (PPO)، إلى تحسين تدريب وكلاء الحوار متعدد الأدوار والتفكير من خلال معالجة عدم استقرار الأداء. تتضمن هذه الطريقة الجديدة أخذ عينات من الأهمية على مستوى الدور وتصحيح انحياز القص لتقوية موثوقية تحديثات التدريب وتقليل التباين في تقديرات التدرج.

La introducción de ST-PPO, una versión estabilizada de la Optimización por Política Proximal (PPO), tiene como objetivo mejorar el entrenamiento de agentes de diálogo y razonamiento en múltiples turnos al abordar la inestabilidad del rendimiento. Este nuevo enfoque incorpora muestreo de importancia a nivel de turno y corrección de sesgo por recorte para mejorar la fiabilidad de las actualizaciones de entrenamiento y reducir la varianza en las estimaciones de gradiente.

L'introduction de ST-PPO, une version stabilisée de l'optimisation par politique proximale (PPO), vise à améliorer l'entraînement des agents de dialogue multi-tours et de raisonnement en s'attaquant à l'instabilité des performances. Cette nouvelle approche intègre un échantillonnage d'importance au niveau des tours et une correction de biais de découpage pour améliorer la fiabilité des mises à jour d'entraînement et réduire la variance des estimations de gradient.

The introduction of ST-PPO, a stabilized version of Proximal Policy Optimization (PPO), aims to enhance the training of multi-turn dialogue and reasoning agents by addressing performance instability. This new approach incorporates turn-level importance sampling and clipping-bias correction to improve the reliability of training updates and reduce variance in gradient estimates.

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

One More Thing in AI – Your Shortcut to AI Mastery

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Sellm

AskTuring

Pointron

PrettyPolly

Ready to build your own newsroom?