Researchers at the University of Science and Technology of China have developed a new <a href="https://venturebeat.com/ai/open-source-deepseek-r1-uses-pure-reinforcement-learning-to-match-openai-o1-at-95-less-cost">reinforcement learning</a> (RL) framework that helps train large language models (LLMs) for complex agentic tasks beyond well-defined problems such as math and coding. Their framework, <a href="https://arxiv.org/abs/2511.14460">Agent-R1</a>, is compatible with popular RL algorithms and shows considerable improvement on reasoning tasks that require multiple retrieval stages and multi-turn interactions with tools. The framework is built on a redefinition of the RL paradigm that takes into account the dynamic nature of agentic applications that require interacting with evolving environments and imperfect information. This framing is much more similar to real-world applications and can have important uses for agentic tasks in enterprise settings.<h2>Rethinking reinforcement learning for agents</h2>RL has become a cornerstone of training LLMs for well-defined reasoning tasks. In areas like mathematics and coding, the model receives a clear signal: The answer is either right or wrong. This makes it relatively straightforward to reward or penalize its behavior. But this approach struggles with agentic tasks that require models to work in interactive environments, develop dynamic memories across conversations, perform multi-step reasoning and respond to unpredictable feedback. Training agents with RL for these scenarios presents unique challenges, especially in multi-turn interactions where designing effective rewards is complex and the trained agent often fails to generalize to the messy, unpredictable nature of real-world environments.To address these challenges, the University of Science and Technology researchers revisited the fundamental framework of RL, known as the <a href="https://en.wikipedia.org/wiki/Markov_decision_process">Markov Decision Process</a> (MDP). An MDP models decision-making using four key components: a state space (the set of possible states an agent can be in); an action space (what the agent can do); a state transition probability (the state to which an action will likely lead); and a reward function (whether the outcome is good or bad). The paper proposes extending this framework to better suit LLM agents.In the new formulation, the state space is expanded to include not just the current state (the current sequence of tokens generated by the model) but the entire history of interactions and environmental feedback. Actions are still fundamentally about generating text, but specific sequences of text can now trigger external tools, like an API call. State transitions become unpredictable, or &quot;stochastic,&quot; because the outcome depends not just on the tokens the model predicts but also on the environment&#x27;s response, which depends on external factors. Finally, the reward system becomes more granular, incorporating intermediate &quot;process rewards&quot; for successfully completing steps along the way, rather than just a single reward at the very end. This provides more frequent and precise guidance to the agent during training.This last bit is especially important and addresses the “sparse reward” problem that most RL frameworks face. When the agent receives a single reward signal based on the final outcome, it does not learn from the right and wrong intermediate steps it has taken along the way. Process rewards solve this problem by providing feedback signals on these intermediate steps, making the learning process much more efficient.“These extensions are crucial for enabling reinforcement learning algorithms to train sophisticated Agents capable of complex, multi-step reasoning and interaction within dynamic environments,” the researchers write in their paper.<h2>The Agent-R1 framework</h2>Based on the extended MDP definition, the researchers developed <a href="https://github.com/0russwest0/Agent-R1">Agent-R1</a>, a flexible and user-friendly training platform for RL-based LLM agents. It extends traditional single-turn RL frameworks to handle the multi-turn, interactive nature of agentic tasks, allowing for seamless integration with diverse environments. The most significant difference lies in the &quot;rollout phase,&quot; where the agent generates responses. In single-turn RL, the model generates a response once. In multi-turn RL, the process involves a series of complex back-and-forth interactions.Agent-R1 achieves this flexible multi-turn rollout with two core modules: Tool and ToolEnv. The Tool module acts as an executor for specific actions such as calling an API or accessing a database. When invoked, a Tool performs its action and returns the direct, raw outcome. In contrast, the ToolEnv module is the orchestrator and interpreter. It takes the output from the Tool and determines how that outcome affects the agent&#x27;s state and the overall task progress. ToolEnv manages state transitions, calculates reward signals based on tool outcomes and packages the new state information for the agent. In short, when an action is complete, the Tool reports &quot;what happened,&quot; while ToolEnv dictates &quot;what this outcome means for the agent and the task.&quot;<h2>Agent-R1 in action</h2>The researchers tested Agent-R1 on the challenging task of multi-hop question answering, which requires complex reasoning, information retrieval across multiple documents and multi-step decision-making. They trained Qwen2.5-3B-Instruct on QA datasets and evaluated its performance on the <a href="https://hotpotqa.github.io/">HotpotQA</a> and <a href="https://huggingface.co/datasets/xanhho/2WikiMultihopQA">2WikiMultihopQA</a> datasets. They also tested it on the Musique dataset, which was out of the domain of tasks the agent was trained on. They compared various RL algorithms trained with Agent-R1 against two baselines: Naive RAG, a single-pass retrieval method where an LLM answers based on one set of retrieved documents, and Base Tool Call, which uses the model&#x27;s native function-calling ability without specialized RL training.The results demonstrated that all RL-trained agents substantially outperformed the baselines. GRPO, an RL algorithm used in advanced reasoning models like <a href="https://venturebeat.com/ai/deepseek-r1-0528-arrives-in-powerful-open-source-challenge-to-openai-o3-and-google-gemini-2-5-pro">DeepSeek-R1</a>, delivered the best overall performance. “These results robustly validate Agent-R1’s efficacy in training powerful LLM agents via end-to-end RL, showing consistent, substantial gains over baselines across diverse datasets and RL algorithms,” the researchers write.These findings can be significant for the enterprise, where there is a strong push to apply RL and reasoning beyond well-defined domains. A framework designed to handle messy, multi-turn interactions with users and dynamic environments can pave the way for new agents capable of solving complex problems in real-world settings.“We hope Agent-R1 provides a foundation for future work on scalable and unified RL training for agentic LLMs,” the researchers conclude.

قدم الباحثون في جامعة العلوم والتكنولوجيا في الصين إطارًا جديدًا للتعلم المعزز، Agent-R1، مصممًا لتدريب نماذج اللغة الكبيرة (LLMs) على مهام معقدة تتجاوز الرياضيات والترميز التقليدي. يعزز هذا الإطار قدرات التفكير من خلال مراحل استرجاع متعددة وتفاعلات مع الأدوات، مما يعالج الطبيعة الديناميكية للتطبيقات في العالم الحقيقي.

Investigadores de la Universidad de Ciencia y Tecnología de China han presentado un nuevo marco de aprendizaje por refuerzo, Agent-R1, diseñado para entrenar modelos de lenguaje de gran tamaño (LLMs) en tareas complejas más allá de las matemáticas y la programación tradicionales. Este marco mejora las capacidades de razonamiento a través de múltiples etapas de recuperación e interacciones con herramientas, abordando la naturaleza dinámica de las aplicaciones del mundo real.

Des chercheurs de l'Université de Science et Technologie de Chine ont introduit un nouveau cadre d'apprentissage par renforcement, Agent-R1, conçu pour former des modèles de langage de grande taille (LLM) à des tâches complexes au-delà des mathématiques et de la programmation traditionnelles. Ce cadre améliore les capacités de raisonnement à travers plusieurs étapes de récupération et interactions avec des outils, abordant la nature dynamique des applications du monde réel.

Researchers at the University of Science and Technology of China have introduced a new reinforcement learning framework, Agent-R1, designed to train large language models (LLMs) for complex tasks beyond traditional math and coding. This framework enhances reasoning capabilities through multiple retrieval stages and interactions with tools, addressing the dynamic nature of real-world applications.

Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

For all their superhuman power, today’s AI models suffer from a surprisingly human flaw: They forget. Give an AI assistant a sprawling conversation, a multi-step reasoning task or a project spanning days, and it will eventually lose the thread. Engineers refer to this phenomenon as “context rot,” and it has quietly become one of the most significant obstacles to building AI agents that can function reliably in the real world.A research team from China and Hong Kong believes it has created a solution to context rot. Their new paper introduces <a href="https://arxiv.org/html/2511.18423v1">general agentic memory (GAM)</a>, a system built to preserve long-horizon information without overwhelming the model. The core premise is simple: Split memory into two specialized roles, one that captures everything, another that retrieves exactly the right things at the right moment.Early results are encouraging, and couldn’t be better timed. As the industry moves beyond prompt engineering and embraces the broader discipline of context engineering, GAM is emerging at precisely the right inflection point.<h2>When bigger context windows still aren’t enough</h2>At the heart of every large language model (LLM) lies a rigid limitation: A fixed “working memory,” more commonly referred to as the context window. Once conversations grow long, older information gets truncated, summarized or silently dropped. This limitation has long been recognized by AI researchers, and since early 2023, developers have been working to expand context windows, rapidly increasing the amount of information a model can handle in a single pass.Mistral’s Mixtral 8x7B debuted with a 32K-token window, which is approximately 24 to 25 words, or about 128 characters in English; essentially a small amount of text, like a single sentence. This was followed by MosaicML’s MPT-7B-StoryWriter-65k+, which more than doubled that capacity; then came Google’s Gemini 1.5 Pro and Anthropic’s Claude 3, offering massive 128K and 200K windows, both of which are extendable to an unprecedented one million tokens. Even Microsoft joined the push, vaulting from the 2K-token limit of the earlier Phi models to the 128K context window of Phi-3. 
Increasing context windows might sound like the obvious fix, but it isn’t. Even models with sprawling 100K-token windows, enough to hold hundreds of pages of text, still struggle to recall details buried near the beginning of a long conversation. Scaling context comes with its own set of problems. As prompts grow longer, models become less reliable at locating and interpreting information because attention over distant tokens weakens and accuracy gradually erodes.Longer inputs also dilute the signal-to-noise ratio, as including every possible detail can actually make responses worse than using a focused prompt. Long prompts also slow models down; more input tokens lead to noticeably higher output-token latency, creating a practical limit on how much context can be used before performance suffers.<h3>Memories are priceless</h3>For most organizations, supersized context windows come with a clear downside — they’re costly. Sending massive prompts through an API is never cheap, and because pricing scales directly with input tokens, even a single bloated request can drive up expenses. Prompt caching helps, but not enough to offset the habit of routinely overloading models with unnecessary context. And that’s the tension at the heart of the issue: Memory is essential to making AI more powerful.As context windows stretch into the hundreds of thousands or millions of tokens, the financial overhead rises just as sharply. Scaling context is both a technical challenge and an economic one, and relying on ever-larger windows quickly becomes an unsustainable strategy for long-term memory.Fixes like summarization and <a href="https://venturebeat.com/ai/why-googles-file-search-could-displace-diy-rag-stacks-in-the-enterprise">retrieval-augmented generation</a> (RAG) aren’t silver bullets either. Summaries inevitably strip away subtle but important details, and traditional RAG, while strong on static documents, tends to break down when information stretches across multiple sessions or evolves over time. Even newer variants, such as agentic RAG and RAG 2.0 (which perform better in steering the retrieval process), still inherit the same foundational flaw of treating retrieval as the solution, rather than treating memory itself as the core problem.<h3>Compilers solved this problem decades ago</h3>If memory is the real bottleneck, and retrieval can’t fix it, then the gap needs a different kind of solution. That’s the bet behind GAM. Instead of pretending retrieval is memory, GAM keeps a full, lossless record and layers smart, on-demand recall on top of it, resurfacing the exact details an agent needs even as conversations twist and evolve. A useful way to understand GAM is through a familiar idea from software engineering: Just-in-time (JIT) compilation. Rather than precomputing a rigid, heavily compressed memory, GAM keeps things light and tight by storing a minimal set of cues, along with a full, untouched archive of raw history. Then, when a request arrives, it “compiles” a tailored context on the fly.This JIT approach is built into GAM’s dual architecture, allowing AI to carry context across long conversations without overcompressing or guessing too early about what matters. The result is the right information, delivered at exactly the right moment.<h2>Inside GAM: A two-agent system built for memory that endures</h2>GAM revolves around the simple idea of separating the act of remembering from recalling, which aptly involves two components: The &#x27;memorizer&#x27; and the &#x27;researcher.&#x27;<h3>The memorizer: Total recall without overload</h3>The memorizer captures every exchange in full, quietly turning each interaction into a concise memo while preserving the complete, decorated session in a searchable page store. It doesn’t compress aggressively or guess what is important. Instead, it organizes interactions into structured pages, adds metadata for efficient retrieval and generates optional lightweight summaries for quick scanning. Critically, every detail is preserved, and nothing is thrown away.<h3>The researcher: A deep retrieval engine</h3>When the agent needs to act, the researcher takes the helm to plan a search strategy, combining embeddings with keyword methods like BM25, navigating through page IDs and stitching the pieces together. It conducts layered searches across the page-store, blending <a href="https://venturebeat.com/data-infrastructure/aws-claims-90-vector-cost-savings-with-s3-vectors-ga-calls-it-complementary">vector retrieval</a>, keyword matching and direct lookups. It evaluates findings, identifies gaps and continues searching until it has sufficient evidence to produce a confident answer, much like a human analyst reviewing old notes and primary documents. It iterates, searches, integrates and reflects until it builds a clean, task-specific briefing. GAM’s power comes from this JIT memory pipeline, which assembles rich, task-specific context on demand instead of leaning on brittle, precomputed summaries. Its core innovation is simple yet powerful, as it preserves all information intact and makes every detail recoverable.Ablation studies support this approach: Traditional memory fails on its own, and naive retrieval isn’t enough. It’s the pairing of a complete archive with an active, iterative research engine that enables GAM to surface details that other systems leave behind.<h2>Outperforming RAG and long-context models</h2>To test GAM, the researchers pitted it against standard RAG pipelines and models with enlarged context windows such as GPT-4o-mini and Qwen2.5-14B. They evaluated GAM using four major long-context and memory-intensive benchmarks, each chosen to test a different aspect of the system’s capabilities:<ul><li>LoCoMo measures an agent’s ability to maintain and recall information across long, multi-session conversations, encompassing single-hop, multi-hop, temporal reasoning and open-domain tasks.</li><li>HotpotQA, a widely used multi-hop QA benchmark built from Wikipedia, was adapted using MemAgent’s memory-stress-test version, which mixes relevant documents with distractors to create contexts of 56K, 224K and 448K tokens — ideal for testing how well GAM handles noisy, sprawling input.</li><li>RULER evaluates retrieval accuracy, multi-hop state tracking, aggregation over long sequences and QA performance under a 128K-token context to further probe long-horizon reasoning.</li><li>NarrativeQA is a benchmark where each question must be answered using the full text of a book or movie script; the researchers sampled 300 examples with an average context size of 87K tokens.</li></ul>Together, these datasets and benchmarks allowed the team to assess both GAM’s ability to preserve detailed historical information and its effectiveness in supporting complex downstream reasoning tasks.GAM came out ahead across all benchmarks. Its biggest win was on RULER, which benchmarks long-range state tracking. Notably: <ul><li>GAM exceeded 90% accuracy.</li><li>RAG collapsed because key details were lost in summaries.</li><li>Long-context models faltered as older information effectively “faded” even when technically present.</li></ul>Clearly, bigger context windows aren’t the answer. GAM works because it retrieves with precision rather than piling up tokens.<h2>GAM, context engineering and competing approaches</h2>Poorly structured context, not model limitations, is often the real reason <a href="https://venturebeat.com/ai/6-proven-lessons-from-the-ai-projects-that-broke-before-they-scaled">AI agents fail</a>. GAM addresses this by ensuring that nothing is permanently lost and that the right information can always be retrieved, even far downstream. The technique’s emergence coincides with the current, broader shift in AI towards context engineering, or the practice of shaping everything an AI model sees — its instructions, history, retrieved documents, tools, preferences and output formats.Context engineering has rapidly eclipsed prompt engineering in importance, although other research groups are tackling the memory problem from different angles. Anthropic is exploring curated, evolving context states. DeepSeek is experimenting with storing memory as images. Another group of Chinese researchers has proposed “semantic operating systems” built around lifelong adaptive memory.However, GAM’s philosophy is distinct: Avoid loss and retrieve with intelligence. Instead of guessing what will matter later, it keeps everything and uses a dedicated research engine to find the relevant pieces at runtime. For agents handling multi-day projects, ongoing workflows or long-term relationships, that reliability may prove essential.<h2>Why GAM matters for the long haul</h2>Just as adding more compute doesn’t automatically produce better algorithms, expanding context windows alone won’t solve AI’s long-term memory problems. Meaningful progress requires rethinking the underlying system, and GAM takes that approach. Instead of depending on ever-larger models, massive context windows or endlessly refined prompts, it treats memory as an engineering challenge — one that benefits from structure rather than brute force.As AI agents transition from clever demos to mission-critical tools, their ability to remember long histories becomes crucial for developing dependable, intelligent systems. Enterprises require AI agents that can track evolving tasks, maintain continuity and recall past interactions with precision and accuracy. GAM offers a practical path toward that future, signaling what may be the next major frontier in AI: Not bigger models, but smarter memory systems and the context architectures that make them possible.

قدم فريق بحثي من الصين وهونغ كونغ بنية ذاكرة جديدة تُدعى الذاكرة الوكيلة العامة (GAM) تهدف إلى معالجة مشكلة 'تدهور السياق' في نماذج الذكاء الاصطناعي، مما يؤدي إلى فقدان المعلومات خلال التفاعلات الطويلة. يقوم هذا النظام المكون من وكيلين بفصل وظائف الذاكرة لتحسين الاحتفاظ بالمعلومات واسترجاعها، مما قد يُحسن أداء مساعدي الذكاء الاصطناعي في المهام المعقدة.

Un equipo de investigación de China y Hong Kong ha presentado una nueva arquitectura de memoria llamada Memoria Agente General (GAM) que busca abordar el problema de la 'degradación del contexto' en los modelos de IA, lo que lleva a la pérdida de información durante interacciones largas. Este sistema de doble agente separa las funciones de memoria para mejorar la retención y recuperación de información, lo que podría mejorar el rendimiento de los asistentes de IA en tareas complejas.

Une équipe de recherche de Chine et de Hong Kong a introduit une nouvelle architecture de mémoire appelée General Agentic Memory (GAM) visant à résoudre le problème de la 'dégradation du contexte' dans les modèles d'IA, qui entraîne la perte d'informations lors d'interactions prolongées. Ce système à double agent sépare les fonctions de mémoire pour améliorer la rétention et la récupération d'informations, ce qui pourrait améliorer les performances des assistants IA dans des tâches complexes.

A research team from China and Hong Kong has introduced a new memory architecture called General Agentic Memory (GAM) aimed at addressing the issue of 'context rot' in AI models, which leads to the loss of information during lengthy interactions. This dual-agent system separates memory functions to enhance information retention and retrieval, potentially improving the performance of AI assistants in complex tasks.

Beyond math and coding: New RL framework helps train LLM agents for complex, real-world tasks

Was this article worth reading? Share it

LucidQuery AI

LucidQuery AI

Chattermate