arXiv:2511.08325v1 Announce Type: cross 
Abstract: Despite rapid development, large language models (LLMs) still encounter challenges in multi-turn decision-making tasks (i.e., agent tasks) like web shopping and browser navigation, which require making a sequence of intelligent decisions based on environmental feedback. Previous work for LLM agents typically relies on elaborate prompt engineering or fine-tuning with expert trajectories to improve performance. In this work, we take a different perspective: we explore constructing process reward models (PRMs) to evaluate each decision and guide the agent's decision-making process. Unlike LLM reasoning, where each step is scored based on correctness, actions in agent tasks do not have a clear-cut correctness. Instead, they should be evaluated based on their proximity to the goal and the progress they have made. Building on this insight, we propose a re-defined PRM for agent tasks, named AgentPRM, to capture both the interdependence between sequential decisions and their contribution to the final goal. This enables better progress tracking and exploration-exploitation balance. To scalably obtain labeled data for training AgentPRM, we employ a Temporal Difference-based (TD-based) estimation method combined with Generalized Advantage Estimation (GAE), which proves more sample-efficient than prior methods. Extensive experiments across different agentic tasks show that AgentPRM is over $8\times$ more compute-efficient than baselines, and it demonstrates robust improvement when scaling up test-time compute. Moreover, we perform detailed analyses to show how our method works and offer more insights, e.g., applying AgentPRM to the reinforcement learning of LLM agents.

تقدم الورقة 'AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress' نهجًا مبتكرًا لتحسين اتخاذ القرار في نماذج اللغة الكبيرة (LLMs) في المهام متعددة الأدوار. من خلال بناء نماذج مكافأة العمليات (PRMs)، تهدف الدراسة إلى تقييم القرارات بناءً على فعاليتها في تحقيق الأهداف بدلاً من دقتها. تظهر هذه الطريقة كفاءة كبيرة، حيث إنها أكثر كفاءة في الحساب بأكثر من ثماني مرات مقارنة بالطرق التقليدية، مما قد يؤدي إلى تحسين الأداء في تطبيقات الذكاء الاصطناعي.

El artículo 'AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress' presenta un enfoque innovador para mejorar la toma de decisiones en modelos de lenguaje de gran tamaño (LLMs) en tareas de múltiples turnos. Al construir modelos de recompensa de proceso (PRMs), el estudio busca evaluar las decisiones en función de su efectividad para alcanzar objetivos en lugar de su corrección. Este método muestra una eficiencia significativa, siendo más de ocho veces más eficiente en términos computacionales que los métodos tradicionales, lo que podría mejorar el rendimiento en aplicaciones de IA.

L'article 'AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress' présente une approche novatrice pour améliorer la prise de décision dans les modèles de langage de grande taille (LLMs) pour des tâches à plusieurs tours. En construisant des modèles de récompense de processus (PRMs), l'étude vise à évaluer les décisions en fonction de leur efficacité à atteindre des objectifs plutôt que de leur exactitude. Cette méthode montre une efficacité significative, étant plus de huit fois plus efficace en termes de calcul que les méthodes traditionnelles, ce qui pourrait améliorer les performances des applications d'IA.

The paper 'AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress' introduces a novel approach to enhance decision-making in large language models (LLMs) for multi-turn tasks. By constructing process reward models (PRMs), the study aims to evaluate decisions based on their effectiveness in achieving goals rather than correctness. This method shows significant efficiency, being over eight times more compute-efficient than traditional methods, which could lead to improved performance in AI applications.

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

Was this article worth reading? Share it

Ready to build your own newsroom?