arXiv:2601.08427v1 Announce Type: cross 
Abstract: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.

تم تقديم إطار عمل جديد يسمى Latent-GRPO لتعزيز أداء التفكير لنماذج اللغة الكبيرة (LLMs) من خلال اشتقاق المكافآت الجوهرية من هندسة الفضاء الكامن، مما يعالج قيود تحسين السياسة النسبية الجماعية (GRPO) التقليدي الذي يعتمد على المراجعين الخارجيين.

Se ha introducido un nuevo marco llamado Latent-GRPO para mejorar el rendimiento de razonamiento de los Modelos de Lenguaje de Gran Escala (LLMs) al derivar recompensas intrínsecas de la geometría del espacio latente, abordando las limitaciones de la Optimización Relativa de Grupo (GRPO) tradicional que depende de verificadores externos.

Un nouveau cadre appelé Latent-GRPO a été introduit pour améliorer la performance de raisonnement des grands modèles de langage (LLMs) en dérivant des récompenses intrinsèques de la géométrie de l'espace latent, abordant les limitations de l'optimisation relative de groupe (GRPO) traditionnelle qui dépend des vérificateurs externes.

A new framework called Latent-GRPO has been introduced to enhance the reasoning performance of Large Language Models (LLMs) by deriving intrinsic rewards from latent space geometry, addressing the limitations of traditional Group Relative Policy Optimization (GRPO) that relies on external verifiers.

Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

One More Thing in AI – Your Shortcut to AI Mastery

Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Airparser

ZeroGPT.org

polygrai

GPTHumanizer

Ready to build your own newsroom?