arXiv:2510.24126v1 Announce Type: new 
Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

تسلط دراسة حديثة الضوء على التقدم في التعلم المعزز (RL) لتحسين وكلاء البحث متعدد الأدوار على المدى الطويل، وخاصة في البحث عن الوثائق القانونية. من خلال استخدام نموذج يحتوي على 14 مليار معلمة، أظهر الباحثون أن التعلم المعزز يمكن أن يحسن الأداء بشكل كبير، حيث حقق دقة مثيرة للإعجاب بنسبة 85% مقارنة بأفضل دقة سابقة بلغت 78%. لا تُظهر هذه الخطوة فقط إمكانيات التعلم المعزز في المهام المعقدة، بل تحدد أيضًا معيارًا جديدًا للتطورات المستقبلية في تقنيات البحث المدعومة بالذكاء الاصطناعي.

Un estudio reciente destaca los avances en el Aprendizaje por Refuerzo (RL) para mejorar los Agentes de Búsqueda Multi-Turno a Largo Plazo, especialmente en búsquedas de documentos legales. Al utilizar un modelo de 14 mil millones de parámetros, los investigadores demostraron que el RL puede mejorar significativamente el rendimiento, alcanzando una impresionante precisión del 85% en comparación con el 78% del mejor anterior. Este avance no solo muestra el potencial del RL en tareas complejas, sino que también establece un nuevo estándar para futuros desarrollos en tecnologías de búsqueda impulsadas por IA.

Une étude récente met en lumière les avancées de l'apprentissage par renforcement (RL) pour améliorer les agents de recherche multi-tours à long terme, en particulier dans les recherches de documents juridiques. En utilisant un modèle de 14 milliards de paramètres, les chercheurs ont démontré que le RL peut améliorer considérablement les performances, atteignant une précision impressionnante de 85 % contre 78 % pour le meilleur précédent. Cette avancée montre non seulement le potentiel du RL dans des tâches complexes, mais établit également une nouvelle norme pour les développements futurs des technologies de recherche alimentées par l'IA.

A recent study highlights the advancements in Reinforcement Learning (RL) for enhancing Long-Horizon Multi-Turn Search Agents, particularly in legal document searches. By utilizing a 14 billion parameter model, researchers demonstrated that RL can significantly improve performance, achieving an impressive 85% accuracy compared to the previous best of 78%. This breakthrough not only showcases the potential of RL in complex tasks but also sets a new standard for future developments in AI-driven search technologies.

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

arXiv:2601.08247v1 Announce Type: new 
Abstract: Financial markets are influenced by human behavior that deviates from rationality due to cognitive biases. Traditional reinforcement learning (RL) models for financial decision-making assume rational agents, potentially overlooking the impact of psychological factors. This study integrates cognitive biases into RL frameworks for financial trading, hypothesizing that such models can exhibit human-like trading behavior and achieve better risk-adjusted returns than standard RL agents. We introduce biases, such as overconfidence and loss aversion, into reward structures and decision-making processes and evaluate their performance in simulated and real-world trading environments. Despite its inconclusive or negative results, this study provides insights into the challenges of incorporating human-like biases into RL, offering valuable lessons for developing robust financial AI systems.

دراسة حديثة نُشرت على arXiv تستكشف دمج التحيزات المعرفية في أطر التعلم المعزز (RL) لاتخاذ القرارات المالية، مشددةً على كيفية تأثير سلوك الإنسان المتأثر بتحيزات مثل الثقة الزائدة والقلق من الخسارة على استراتيجيات التداول. تهدف الأبحاث إلى إثبات أن نماذج التعلم المعزز التي تتضمن هذه التحيزات يمكن أن تحقق عوائد معدلة للمخاطر أفضل مقارنة بالنماذج التقليدية التي تفترض العقلانية.

Un estudio reciente publicado en arXiv explora la integración de sesgos cognitivos en los marcos de aprendizaje por refuerzo (RL) para la toma de decisiones financieras, destacando cómo el comportamiento humano influenciado por sesgos como la sobreconfianza y la aversión a la pérdida puede afectar las estrategias de trading. La investigación busca demostrar que los modelos de RL que incorporan estos sesgos pueden lograr mejores rendimientos ajustados al riesgo en comparación con los modelos tradicionales que asumen racionalidad.

Une étude récente publiée sur arXiv examine l'intégration des biais cognitifs dans les cadres d'apprentissage par renforcement (RL) pour la prise de décision financière, soulignant comment le comportement humain influencé par des biais tels que la surconfiance et l'aversion à la perte peut affecter les stratégies de trading. La recherche vise à démontrer que les modèles RL intégrant ces biais peuvent obtenir de meilleurs rendements ajustés au risque par rapport aux modèles traditionnels qui supposent la rationalité.

A recent study published on arXiv explores the integration of cognitive biases into reinforcement learning (RL) frameworks for financial decision-making, highlighting how human behavior influenced by biases like overconfidence and loss aversion can affect trading strategies. The research aims to demonstrate that RL models incorporating these biases can achieve better risk-adjusted returns compared to traditional models that assume rationality.

Incorporating Cognitive Biases into Reinforcement Learning for Financial Decision-Making

arXiv:2510.21060v2 Announce Type: replace 
Abstract: Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

تم نشر دراسة حديثة حول تحسين السياسات بشكل مختلف (DPPO) تركز على تعقيد العينة في تحسين السياسات (PO) في التعلم المعزز (RL). تتناول هذه الدراسة المخاوف المتعلقة بالخصوصية في التطبيقات الحساسة مثل الروبوتات والرعاية الصحية من خلال صياغة تعريف للخصوصية المختلفة يتناسب مع PO وتحليل تعقيد العينة لعدة خوارزميات PO تحت قيود DP.

Se ha publicado un estudio reciente sobre la optimización de políticas diferencialmente privadas (DPPO), centrado en la complejidad de muestra de la optimización de políticas (PO) en el aprendizaje por refuerzo (RL). Esta investigación aborda las preocupaciones de privacidad en aplicaciones sensibles como la robótica y la atención médica al formalizar una definición de privacidad diferencial adaptada a la PO y analizar la complejidad de muestra de varios algoritmos de PO bajo restricciones de DP.

Une étude récente sur l'optimisation de politiques différemment privées (DPPO) a été publiée, se concentrant sur la complexité d'échantillonnage de l'optimisation de politiques (PO) dans l'apprentissage par renforcement (RL). Cette recherche aborde les préoccupations en matière de confidentialité dans des applications sensibles telles que la robotique et les soins de santé en formalisant une définition de la confidentialité différentielle adaptée à la PO et en analysant la complexité d'échantillonnage de divers algorithmes de PO sous des contraintes de DP.

A recent study on differentially private policy optimization (DPPO) has been published, focusing on the sample complexity of policy optimization (PO) in reinforcement learning (RL). This research addresses privacy concerns in sensitive applications such as robotics and healthcare by formalizing a definition of differential privacy tailored to PO and analyzing the sample complexity of various PO algorithms under DP constraints.

On the Sample Complexity of Differentially Private Policy Optimization

One More Thing in AI – Your Shortcut to AI Mastery

Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Airparser

Legion AI

LLMrefs

LangWatch

Ready to build your own newsroom?