arXiv:2512.01311v2 Announce Type: replace-cross 
Abstract: Large language model based agents are increasingly deployed in complex, tool augmented environments. While reinforcement learning provides a principled mechanism for such agents to improve through interaction, its effectiveness critically depends on the availability of structured training tasks. In many realistic settings, however, no such tasks exist a challenge we term task scarcity, which has become a key bottleneck for scaling agentic RL. Existing approaches typically assume predefined task collections, an assumption that fails in novel environments where tool semantics and affordances are initially unknown. To address this limitation, we formalize the problem of Task Generation for Agentic RL, where an agent must learn within a given environment that lacks predefined tasks. We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks directly from the environment structure and affordances, without relying on handcrafted seeds or external corpora. CuES drives exploration through intrinsic curiosity, abstracts interaction patterns into reusable task schemas, and refines them through lightweight top down guidance and memory based quality control. Across three representative environments, AppWorld, BFCL, and WebShop, CuES produces task distributions that match or surpass manually curated datasets in both diversity and executability, yielding substantial downstream policy improvements. These results demonstrate that curiosity driven, environment grounded task generation provides a scalable foundation for agents that not only learn how to act, but also learn what to learn. The code is available at https://github.com/modelscope/AgentEvolver/tree/main/research/CuES.

تم تقديم إطار عمل جديد يسمى CuES لتحسين التعلم المعزز الوكالي (RL) من خلال إنشاء مهام متنوعة وذات مغزى بشكل مستقل في البيئات التي تفتقر إلى المهام المحددة مسبقًا. هذا يعالج تحدي ندرة المهام، الذي أعاق قابلية توسيع RL في الإعدادات المعقدة حيث تكون دلالات الأدوات غير معروفة في البداية.

Se ha introducido un nuevo marco llamado CuES para mejorar el aprendizaje por refuerzo agentivo (RL) generando de manera autónoma tareas diversas y significativas en entornos que carecen de tareas predefinidas. Esto aborda el desafío de la escasez de tareas, que ha obstaculizado la escalabilidad del RL en configuraciones complejas donde la semántica de las herramientas es inicialmente desconocida.

Un nouveau cadre appelé CuES a été introduit pour améliorer l'apprentissage par renforcement agentique (RL) en générant de manière autonome des tâches diverses et significatives dans des environnements dépourvus de tâches prédéfinies. Cela répond au défi de la rareté des tâches, qui a entravé l'évolutivité du RL dans des contextes complexes où la sémantique des outils est initialement inconnue.

A new framework called CuES has been introduced to enhance agentic reinforcement learning (RL) by autonomously generating diverse and meaningful tasks in environments lacking predefined tasks. This addresses the challenge of task scarcity, which has hindered the scalability of RL in complex settings where tool semantics are initially unknown.

CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL

arXiv:2509.22601v4 Announce Type: replace 
Abstract: Reinforcement learning (RL) is the dominant paradigm for sharpening strategic tool use capabilities of LLMs on long-horizon, sparsely-rewarded agent tasks, yet it faces a fundamental challenge of exploration-exploitation trade-off. Existing studies stimulate exploration through the lens of policy entropy, but such mechanical entropy maximization is prone to RL instability due to the multi-turn distribution shifting. In this paper, we target the progressive exploration-exploitation balance under the guidance of the agent's own experiences without succumbing to either entropy collapsing or runaway divergence. We propose SPEAR, a self-imitation learning (SIL) recipe for training agentic LLMs. It extends the vanilla SIL, where a replay buffer stores good experience for off-policy update, by gradually steering the policy entropy across stages. Specifically, the proposed curriculum scheduling harmonizes intrinsic reward shaping and self-imitation to 1) expedite exploration via frequent tool interactions at the beginning, and 2) strengthen exploitation of successful tactics upon convergence towards familiarity with the environment. We also combine bag-of-tricks of industrial RL optimizations for a strong baseline Dr.BoT to demonstrate our effectiveness. In ALFWorld and WebShop, SPEAR increases the success rates of GRPO/GiGPO/Dr.BoT by up to 16.1%/5.1%/8.6% and 20.7%/11.8%/13.9%, respectively. In AIME24 and AIME25, SPEAR boosts Dr.BoT by up to 3.8% and 6.1%, respectively. Such gains incur only 10%-25% extra theoretical complexity and negligible runtime overhead in practice, demonstrating the plug-and-play scalability of SPEAR.

تقدم دراسة جديدة SPEAR، وهو نهج للتعلم الذاتي من خلال التقليد مصمم لتحسين توازن الاستكشاف والاستغلال في التعلم المعزز لنماذج اللغة الكبيرة (LLMs). تهدف هذه الطريقة إلى تحسين استقرار تدريب التعلم المعزز من خلال استخدام تجارب الوكيل الخاصة لتوجيه تعديلات انتروبيا السياسة، مما يعالج التحديات المرتبطة بتقنيات الاستكشاف التقليدية.

Un nuevo estudio presenta SPEAR, un enfoque de aprendizaje por autoimitación diseñado para mejorar el equilibrio entre exploración y explotación en el aprendizaje por refuerzo para modelos de lenguaje de gran tamaño (LLMs). Este método busca mejorar la estabilidad del entrenamiento de RL utilizando las propias experiencias del agente para guiar los ajustes de entropía de la política, abordando así los desafíos asociados con las técnicas de exploración tradicionales.

Une nouvelle étude présente SPEAR, une approche d'apprentissage par auto-imitation conçue pour améliorer l'équilibre exploration-exploitation dans l'apprentissage par renforcement pour les modèles de langage de grande taille (LLMs). Cette méthode vise à améliorer la stabilité de l'entraînement RL en utilisant les propres expériences de l'agent pour guider les ajustements de l'entropie de la politique, abordant ainsi les défis associés aux techniques d'exploration traditionnelles.

A new study introduces SPEAR, a self-imitation learning approach designed to enhance the exploration-exploitation balance in reinforcement learning for large language models (LLMs). This method aims to improve the stability of RL training by utilizing the agent's own experiences to guide policy entropy adjustments, addressing challenges associated with traditional exploration techniques.

CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL

Was this article worth reading? Share it

LucidQuery AI

Chattermate

Scop.ai