arXiv:2510.21122v2 Announce Type: replace 
Abstract: Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

تقديم NoisyGRPO يمثل تقدمًا كبيرًا في مجال التعلم المعزز، خاصةً لنماذج اللغة متعددة الوسائط. من خلال إدخال ضوضاء قابلة للتحكم في المدخلات المرئية، يهدف هذا الإطار المبتكر إلى تعزيز قدرات التفكير المتسلسل العامة، مع معالجة القيود التي تواجهها طرق التعلم المعزز الحالية التي غالبًا ما تفشل في التعميم بشكل فعال. هذا التطور مهم لأنه يفتح آفاقًا جديدة لتحسين قدرات التفكير لدى الذكاء الاصطناعي، مما يجعله أكثر تكيفًا وكفاءة في التطبيقات الواقعية.

La introducción de NoisyGRPO marca un avance significativo en el campo del aprendizaje por refuerzo, especialmente para los modelos de lenguaje multimodal. Al incorporar ruido controlable en las entradas visuales, este innovador marco busca mejorar las capacidades generales de razonamiento de la cadena de pensamiento, abordando las limitaciones de los métodos de aprendizaje por refuerzo existentes que a menudo no logran generalizar de manera efectiva. Este desarrollo es crucial, ya que abre nuevas avenidas para mejorar las habilidades de razonamiento de la IA, haciéndola más adaptable y eficiente en aplicaciones del mundo real.

L'introduction de NoisyGRPO représente une avancée significative dans le domaine de l'apprentissage par renforcement, en particulier pour les grands modèles de langage multimodaux. En intégrant un bruit contrôlable dans les entrées visuelles, ce cadre innovant vise à améliorer les capacités de raisonnement général de la chaîne de pensée, en répondant aux limites des méthodes d'apprentissage par renforcement existantes qui échouent souvent à se généraliser efficacement. Ce développement est crucial car il ouvre de nouvelles voies pour améliorer les capacités de raisonnement de l'IA, la rendant plus adaptable et efficace dans des applications réelles.

The introduction of NoisyGRPO marks a significant advancement in the field of reinforcement learning, particularly for multimodal large language models. By incorporating controllable noise into visual inputs, this innovative framework aims to enhance the general Chain-of-Thought reasoning capabilities, addressing the limitations of existing RL methods that often fail to generalize effectively. This development is crucial as it opens new avenues for improving AI's reasoning abilities, making it more adaptable and efficient in real-world applications.

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

arXiv:2601.08247v1 Announce Type: new 
Abstract: Financial markets are influenced by human behavior that deviates from rationality due to cognitive biases. Traditional reinforcement learning (RL) models for financial decision-making assume rational agents, potentially overlooking the impact of psychological factors. This study integrates cognitive biases into RL frameworks for financial trading, hypothesizing that such models can exhibit human-like trading behavior and achieve better risk-adjusted returns than standard RL agents. We introduce biases, such as overconfidence and loss aversion, into reward structures and decision-making processes and evaluate their performance in simulated and real-world trading environments. Despite its inconclusive or negative results, this study provides insights into the challenges of incorporating human-like biases into RL, offering valuable lessons for developing robust financial AI systems.

دراسة حديثة نُشرت على arXiv تستكشف دمج التحيزات المعرفية في أطر التعلم المعزز (RL) لاتخاذ القرارات المالية، مشددةً على كيفية تأثير سلوك الإنسان المتأثر بتحيزات مثل الثقة الزائدة والقلق من الخسارة على استراتيجيات التداول. تهدف الأبحاث إلى إثبات أن نماذج التعلم المعزز التي تتضمن هذه التحيزات يمكن أن تحقق عوائد معدلة للمخاطر أفضل مقارنة بالنماذج التقليدية التي تفترض العقلانية.

Un estudio reciente publicado en arXiv explora la integración de sesgos cognitivos en los marcos de aprendizaje por refuerzo (RL) para la toma de decisiones financieras, destacando cómo el comportamiento humano influenciado por sesgos como la sobreconfianza y la aversión a la pérdida puede afectar las estrategias de trading. La investigación busca demostrar que los modelos de RL que incorporan estos sesgos pueden lograr mejores rendimientos ajustados al riesgo en comparación con los modelos tradicionales que asumen racionalidad.

Une étude récente publiée sur arXiv examine l'intégration des biais cognitifs dans les cadres d'apprentissage par renforcement (RL) pour la prise de décision financière, soulignant comment le comportement humain influencé par des biais tels que la surconfiance et l'aversion à la perte peut affecter les stratégies de trading. La recherche vise à démontrer que les modèles RL intégrant ces biais peuvent obtenir de meilleurs rendements ajustés au risque par rapport aux modèles traditionnels qui supposent la rationalité.

A recent study published on arXiv explores the integration of cognitive biases into reinforcement learning (RL) frameworks for financial decision-making, highlighting how human behavior influenced by biases like overconfidence and loss aversion can affect trading strategies. The research aims to demonstrate that RL models incorporating these biases can achieve better risk-adjusted returns compared to traditional models that assume rationality.

Incorporating Cognitive Biases into Reinforcement Learning for Financial Decision-Making

arXiv:2510.21060v2 Announce Type: replace 
Abstract: Policy optimization (PO) is a cornerstone of modern reinforcement learning (RL), with diverse applications spanning robotics, healthcare, and large language model training. The increasing deployment of PO in sensitive domains, however, raises significant privacy concerns. In this paper, we initiate a theoretical study of differentially private policy optimization, focusing explicitly on its sample complexity. We first formalize an appropriate definition of differential privacy (DP) tailored to PO, addressing the inherent challenges arising from on-policy learning dynamics and the subtlety involved in defining the unit of privacy. We then systematically analyze the sample complexity of widely-used PO algorithms, including policy gradient (PG), natural policy gradient (NPG) and more, under DP constraints and various settings, via a unified framework. Our theoretical results demonstrate that privacy costs can often manifest as lower-order terms in the sample complexity, while also highlighting subtle yet important observations in private PO settings. These offer valuable practical insights for privacy-preserving PO algorithms.

تم نشر دراسة حديثة حول تحسين السياسات بشكل مختلف (DPPO) تركز على تعقيد العينة في تحسين السياسات (PO) في التعلم المعزز (RL). تتناول هذه الدراسة المخاوف المتعلقة بالخصوصية في التطبيقات الحساسة مثل الروبوتات والرعاية الصحية من خلال صياغة تعريف للخصوصية المختلفة يتناسب مع PO وتحليل تعقيد العينة لعدة خوارزميات PO تحت قيود DP.

Se ha publicado un estudio reciente sobre la optimización de políticas diferencialmente privadas (DPPO), centrado en la complejidad de muestra de la optimización de políticas (PO) en el aprendizaje por refuerzo (RL). Esta investigación aborda las preocupaciones de privacidad en aplicaciones sensibles como la robótica y la atención médica al formalizar una definición de privacidad diferencial adaptada a la PO y analizar la complejidad de muestra de varios algoritmos de PO bajo restricciones de DP.

Une étude récente sur l'optimisation de politiques différemment privées (DPPO) a été publiée, se concentrant sur la complexité d'échantillonnage de l'optimisation de politiques (PO) dans l'apprentissage par renforcement (RL). Cette recherche aborde les préoccupations en matière de confidentialité dans des applications sensibles telles que la robotique et les soins de santé en formalisant une définition de la confidentialité différentielle adaptée à la PO et en analysant la complexité d'échantillonnage de divers algorithmes de PO sous des contraintes de DP.

A recent study on differentially private policy optimization (DPPO) has been published, focusing on the sample complexity of policy optimization (PO) in reinforcement learning (RL). This research addresses privacy concerns in sensitive applications such as robotics and healthcare by formalizing a definition of differential privacy tailored to PO and analyzing the sample complexity of various PO algorithms under DP constraints.

On the Sample Complexity of Differentially Private Policy Optimization

One More Thing in AI – Your Shortcut to AI Mastery

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Was this article worth reading? Share it

One More Thing in AI

LucidQuery AI

Chattermate

Magicley AI

ChatOne

Sellm

Ready to build your own newsroom?