arXiv:2510.26575v1 Announce Type: new 
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising approach for enhancing agentic deep search. However, its application is often hindered by low \textbf{Reward Density} in deep search scenarios, where agents expend significant exploratory costs for infrequent and often null final rewards. In this paper, we formalize this challenge as the \textbf{Reward Density Optimization} problem, which aims to improve the reward obtained per unit of exploration cost. This paper introduce \textbf{InfoFlow}, a systematic framework that tackles this problem from three aspects. 1) \textbf{Subproblem decomposition}: breaking down long-range tasks to assign process rewards, thereby providing denser learning signals. 2) \textbf{Failure-guided hints}: injecting corrective guidance into stalled trajectories to increase the probability of successful outcomes. 3) \textbf{Dual-agent refinement}: employing a dual-agent architecture to offload the cognitive burden of deep exploration. A refiner agent synthesizes the search history, which effectively compresses the researcher's perceived trajectory, thereby reducing exploration cost and increasing the overall reward density. We evaluate InfoFlow on multiple agentic search benchmarks, where it significantly outperforms strong baselines, enabling lightweight LLMs to achieve performance comparable to advanced proprietary LLMs.

تقدم ورقة بحثية حديثة نهجًا جديدًا لتعزيز وكلاء البحث العميق من خلال تحسين كثافة المكافآت، حيث تتناول تحديًا شائعًا في التعلم المعزز حيث يواجه الوكلاء تكاليف استكشاف عالية مقابل مكافآت ضئيلة. هذه الخطوة مهمة لأنها قد تؤدي إلى خوارزميات بحث أكثر كفاءة، مما يحسن تطبيقات مختلفة في الذكاء الاصطناعي والتعلم الآلي.

Un artículo reciente presenta un enfoque novedoso para mejorar los agentes de búsqueda profunda a través de la Optimización de la Densidad de Recompensas, abordando un desafío común en el aprendizaje por refuerzo donde los agentes enfrentan altos costos de exploración por recompensas mínimas. Este avance es significativo ya que podría llevar a algoritmos de búsqueda más eficientes, mejorando diversas aplicaciones en IA y aprendizaje automático.

Un récent article présente une nouvelle approche pour améliorer les agents de recherche profonde grâce à l'optimisation de la densité de récompense, abordant un défi courant dans l'apprentissage par renforcement où les agents font face à des coûts d'exploration élevés pour des récompenses minimales. Cette avancée est significative car elle pourrait conduire à des algorithmes de recherche plus efficaces, améliorant ainsi diverses applications en IA et en apprentissage automatique.

A recent paper introduces a novel approach to enhance deep search agents through Reward Density Optimization, addressing a common challenge in reinforcement learning where agents face high exploratory costs for minimal rewards. This advancement is significant as it could lead to more efficient and effective search algorithms, ultimately improving various applications in AI and machine learning.

InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

arXiv:2411.00365v2 Announce Type: replace 
Abstract: In the paradigm of decentralized learning, a group of agents collaborate to learn a global model using a distributed dataset without a central server; nevertheless, it is severely challenged by the heterogeneity of the data distribution across the agents. For example, the data may be distributed non-independently and identically, and even be noised or poisoned. To address these data challenges, we propose ROSS, a novel robust decentralized stochastic learning algorithm based on Shapley values, in this paper. Specifically, in each round, each agent aggregates the cross-gradient information from its neighbors, i.e., the derivatives of its local model with respect to the datasets of its neighbors, to update its local model in a momentum like manner, while we innovate in weighting the derivatives according to their contributions measured by Shapley values. We perform solid theoretical analysis to reveal the linear convergence speedup of our ROSS algorithm. We also verify the efficacy of our algorithm through extensive experiments on public datasets. Our results demonstrate that, in face of the above variety of data challenges, our ROSS algorithm has significant advantages over existing state-of-the-art proposals in terms of both convergence and prediction accuracy.

تم اقتراح خوارزمية جديدة للتعلم اللامركزي تُدعى ROSS، والتي تستخدم قيم شابلي لتعزيز قوة التعلم العشوائي بين الوكلاء. تتناول هذه الطريقة التحديات التي تطرحها توزيعات البيانات غير المتجانسة، مما يسمح للوكلاء بتعلم نموذج عالمي بشكل تعاوني دون الحاجة إلى خادم مركزي. يقوم كل وكيل بتحديث نموذجه من خلال تجميع معلومات التدرج المتقاطع من الوكلاء المجاورين، مع وزنها وفقًا لمساهماتهم.

Se ha propuesto un nuevo algoritmo de aprendizaje descentralizado llamado ROSS, que utiliza valores de Shapley para mejorar la robustez del aprendizaje estocástico entre agentes. Este enfoque aborda los desafíos planteados por las distribuciones de datos heterogéneas, permitiendo que los agentes aprendan de manera colaborativa un modelo global sin un servidor central. Cada agente actualiza su modelo agregando información de gradiente cruzado de los agentes vecinos, ponderada por sus contribuciones.

Un nouvel algorithme d'apprentissage décentralisé nommé ROSS a été proposé, utilisant les valeurs de Shapley pour améliorer la robustesse de l'apprentissage stochastique parmi les agents. Cette approche répond aux défis posés par les distributions de données hétérogènes, permettant aux agents d'apprendre de manière collaborative un modèle global sans serveur central. Chaque agent met à jour son modèle en agrégeant des informations de gradient croisé provenant des agents voisins, pondérées par leurs contributions.

A new decentralized learning algorithm named ROSS has been proposed, which utilizes Shapley values to enhance the robustness of stochastic learning among agents. This approach addresses challenges posed by heterogeneous data distributions, allowing agents to collaboratively learn a global model without a central server. Each agent updates its model by aggregating cross-gradient information from neighboring agents, weighted by their contributions.

ROSS: RObust decentralized Stochastic learning based on Shapley values

One More Thing in AI – Your Shortcut to AI Mastery

InfoFlow: Reinforcing Search Agent Via Reward Density Optimization

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Research AI

Redreach AI

Resyfy AI

Kwrds

Ready to build your own newsroom?