arXiv:2510.26026v1 Announce Type: new 
Abstract: Reliable uncertainty quantification is crucial for reinforcement learning (RL) in high-stakes settings. We propose a unified conformal prediction framework for infinite-horizon policy evaluation that constructs distribution-free prediction intervals {for returns} in both on-policy and off-policy settings. Our method integrates distributional RL with conformal calibration, addressing challenges such as unobserved returns, temporal dependencies, and distributional shifts. We propose a modular pseudo-return construction based on truncated rollouts and a time-aware calibration strategy using experience replay and weighted subsampling. These innovations mitigate model bias and restore approximate exchangeability, enabling uncertainty quantification even under policy shifts. Our theoretical analysis provides coverage guarantees that account for model misspecification and importance weight estimation. Empirical results, including experiments in synthetic and benchmark environments like Mountain Car, show that our method significantly improves coverage and reliability over standard distributional RL baselines.

تم تقديم إطار جديد للتعلم المعزز يركز على قياس عدم اليقين بشكل موثوق في البيئات عالية المخاطر. تهدف هذه الطريقة، التي تجمع بين التنبؤ المتوافق والتعلم المعزز التوزيعي، إلى توفير فترات تنبؤ خالية من التوزيع لتقييم السياسات. هذا مهم لأنه يعالج تحديات حاسمة مثل العوائد غير المرصودة والاعتماد الزمني، مما قد يعزز فعالية تطبيقات التعلم المعزز في مجالات متنوعة.

Se ha presentado un nuevo marco para el aprendizaje por refuerzo, centrado en la cuantificación fiable de la incertidumbre en entornos de alto riesgo. Este método, que combina la predicción conforme con el aprendizaje por refuerzo distribucional, tiene como objetivo proporcionar intervalos de predicción libres de distribución para la evaluación de políticas. Esto es significativo porque aborda desafíos críticos como los rendimientos no observados y las dependencias temporales, lo que podría mejorar la efectividad de las aplicaciones de RL en diversos campos.

Un nouveau cadre pour l'apprentissage par renforcement a été introduit, axé sur la quantification fiable de l'incertitude dans des environnements à enjeux élevés. Cette méthode, qui combine la prédiction conforme avec l'apprentissage par renforcement distributionnel, vise à fournir des intervalles de prédiction sans distribution pour l'évaluation des politiques. Cela est important car cela répond à des défis critiques tels que les rendements non observés et les dépendances temporelles, ce qui pourrait améliorer l'efficacité des applications de l'apprentissage par renforcement dans divers domaines.

A new framework for reinforcement learning has been introduced, focusing on reliable uncertainty quantification in high-stakes environments. This method, which combines conformal prediction with distributional reinforcement learning, aims to provide distribution-free prediction intervals for policy evaluation. This is significant because it addresses critical challenges like unobserved returns and temporal dependencies, potentially enhancing the effectiveness of RL applications in various fields.

Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

arXiv:2601.08247v1 Announce Type: new 
Abstract: Financial markets are influenced by human behavior that deviates from rationality due to cognitive biases. Traditional reinforcement learning (RL) models for financial decision-making assume rational agents, potentially overlooking the impact of psychological factors. This study integrates cognitive biases into RL frameworks for financial trading, hypothesizing that such models can exhibit human-like trading behavior and achieve better risk-adjusted returns than standard RL agents. We introduce biases, such as overconfidence and loss aversion, into reward structures and decision-making processes and evaluate their performance in simulated and real-world trading environments. Despite its inconclusive or negative results, this study provides insights into the challenges of incorporating human-like biases into RL, offering valuable lessons for developing robust financial AI systems.

دراسة حديثة نُشرت على arXiv تستكشف دمج التحيزات المعرفية في أطر التعلم المعزز (RL) لاتخاذ القرارات المالية، مشددةً على كيفية تأثير سلوك الإنسان المتأثر بتحيزات مثل الثقة الزائدة والقلق من الخسارة على استراتيجيات التداول. تهدف الأبحاث إلى إثبات أن نماذج التعلم المعزز التي تتضمن هذه التحيزات يمكن أن تحقق عوائد معدلة للمخاطر أفضل مقارنة بالنماذج التقليدية التي تفترض العقلانية.

Un estudio reciente publicado en arXiv explora la integración de sesgos cognitivos en los marcos de aprendizaje por refuerzo (RL) para la toma de decisiones financieras, destacando cómo el comportamiento humano influenciado por sesgos como la sobreconfianza y la aversión a la pérdida puede afectar las estrategias de trading. La investigación busca demostrar que los modelos de RL que incorporan estos sesgos pueden lograr mejores rendimientos ajustados al riesgo en comparación con los modelos tradicionales que asumen racionalidad.

Une étude récente publiée sur arXiv examine l'intégration des biais cognitifs dans les cadres d'apprentissage par renforcement (RL) pour la prise de décision financière, soulignant comment le comportement humain influencé par des biais tels que la surconfiance et l'aversion à la perte peut affecter les stratégies de trading. La recherche vise à démontrer que les modèles RL intégrant ces biais peuvent obtenir de meilleurs rendements ajustés au risque par rapport aux modèles traditionnels qui supposent la rationalité.

A recent study published on arXiv explores the integration of cognitive biases into reinforcement learning (RL) frameworks for financial decision-making, highlighting how human behavior influenced by biases like overconfidence and loss aversion can affect trading strategies. The research aims to demonstrate that RL models incorporating these biases can achieve better risk-adjusted returns compared to traditional models that assume rationality.

Incorporating Cognitive Biases into Reinforcement Learning for Financial Decision-Making

arXiv:2510.21060v2 Announce Type: replace 
Abstract: Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

تم نشر دراسة حديثة حول تحسين السياسات بشكل مختلف (DPPO) تركز على تعقيد العينة في تحسين السياسات (PO) في التعلم المعزز (RL). تتناول هذه الدراسة المخاوف المتعلقة بالخصوصية في التطبيقات الحساسة مثل الروبوتات والرعاية الصحية من خلال صياغة تعريف للخصوصية المختلفة يتناسب مع PO وتحليل تعقيد العينة لعدة خوارزميات PO تحت قيود DP.

Se ha publicado un estudio reciente sobre la optimización de políticas diferencialmente privadas (DPPO), centrado en la complejidad de muestra de la optimización de políticas (PO) en el aprendizaje por refuerzo (RL). Esta investigación aborda las preocupaciones de privacidad en aplicaciones sensibles como la robótica y la atención médica al formalizar una definición de privacidad diferencial adaptada a la PO y analizar la complejidad de muestra de varios algoritmos de PO bajo restricciones de DP.

Une étude récente sur l'optimisation de politiques différemment privées (DPPO) a été publiée, se concentrant sur la complexité d'échantillonnage de l'optimisation de politiques (PO) dans l'apprentissage par renforcement (RL). Cette recherche aborde les préoccupations en matière de confidentialité dans des applications sensibles telles que la robotique et les soins de santé en formalisant une définition de la confidentialité différentielle adaptée à la PO et en analysant la complexité d'échantillonnage de divers algorithmes de PO sous des contraintes de DP.

A recent study on differentially private policy optimization (DPPO) has been published, focusing on the sample complexity of policy optimization (PO) in reinforcement learning (RL). This research addresses privacy concerns in sensitive applications such as robotics and healthcare by formalizing a definition of differential privacy tailored to PO and analyzing the sample complexity of various PO algorithms under DP constraints.

Conformal Prediction Beyond the Horizon: Distribution-Free Inference for Policy Evaluation

Was this article worth reading? Share it

LucidQuery AI

Portfolio Backtest

Adaptive Privacy Policy Generator

Acturhire

Research AI

LangWatch

Ready to build your own newsroom?