arXiv:2411.18551v2 Announce Type: replace-cross 
Abstract: In this paper, we investigate the concentration properties of cumulative reward in Markov Decision Processes (MDPs), focusing on both asymptotic and non-asymptotic settings. We introduce a unified approach to characterize reward concentration in MDPs, covering both infinite-horizon settings (i.e., average and discounted reward frameworks) and finite-horizon setting. Our asymptotic results include the law of large numbers, the central limit theorem, and the law of iterated logarithms, while our non-asymptotic bounds include Azuma-Hoeffding-type inequalities and a non-asymptotic version of the law of iterated logarithms. Additionally, we explore two key implications of our results. First, we analyze the sample path behavior of the difference in rewards between any two stationary policies. Second, we show that two alternative definitions of regret for learning policies proposed in the literature are rate-equivalent. Our proof techniques rely on a martingale decomposition of cumulative reward, properties of the solution to the policy evaluation fixed-point equation, and both asymptotic and non-asymptotic concentration results for martingale difference sequences.

دراسة حديثة استكشفت خصائص تركيز المكافآت التراكمية في عمليات اتخاذ القرار ماركوف (MDPs)، مع التركيز على السياقات الأسية وغير الأسية. تقدم هذه الدراسة نهجًا موحدًا لتوصيف تركيز المكافآت عبر أطر زمنية غير محدودة ومحدودة، مقدمة نتائج مهمة مثل قانون الأعداد الكبيرة ونظرية الحد المركزي.

Un estudio reciente ha explorado las propiedades de concentración de la recompensa acumulativa en los Procesos de Decisión de Markov (MDPs), abordando tanto contextos asintóticos como no asintóticos. La investigación introduce un enfoque unificado para caracterizar la concentración de recompensas en marcos de horizonte infinito y finito, presentando resultados significativos como la ley de los grandes números y el teorema central del límite.

Une étude récente a exploré les propriétés de concentration de la récompense cumulative dans les processus de décision de Markov (MDP), abordant à la fois des contextes asymptotiques et non asymptotiques. La recherche introduit une approche unifiée pour caractériser la concentration de la récompense dans des cadres à horizon infini et fini, présentant des résultats significatifs tels que la loi des grands nombres et le théorème central limite.

A recent study has explored the concentration properties of cumulative reward in Markov Decision Processes (MDPs), addressing both asymptotic and non-asymptotic settings. The research introduces a unified approach to characterize reward concentration across infinite-horizon and finite-horizon frameworks, presenting significant results such as the law of large numbers and central limit theorem.

Concentration of Cumulative Reward in Markov Decision Processes

arXiv:2512.08052v1 Announce Type: cross 
Abstract: Embodied agents, such as robots and virtual characters, must continuously select actions to execute tasks effectively, solving complex sequential decision-making problems. Given the difficulty of designing such controllers manually, learning-based approaches have emerged as promising alternatives, most notably Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL). DRL leverages reward signals to optimize behavior, while DIL uses expert demonstrations to guide learning. This document introduces DRL and DIL in the context of embodied agents, adopting a concise, depth-first approach to the literature. It is self-contained, presenting all necessary mathematical and machine learning concepts as they are needed. It is not intended as a survey of the field; rather, it focuses on a small set of foundational algorithms and techniques, prioritizing in-depth understanding over broad coverage. The material ranges from Markov Decision Processes to REINFORCE and Proximal Policy Optimization (PPO) for DRL, and from Behavioral Cloning to Dataset Aggregation (DAgger) and Generative Adversarial Imitation Learning (GAIL) for DIL.

تُبرز مقدمة التعلم العميق المعزز (DRL) والتعلم العميق بالاقتداء (DIL) أهمية الأساليب المعتمدة على التعلم للوكالات المجسدة، مثل الروبوتات والشخصيات الافتراضية، التي يجب أن تتنقل في مهام اتخاذ القرار المعقدة. يركز هذا المستند على الخوارزميات الأساسية مثل REINFORCE وتحسين السياسة القريبة، مقدماً نظرة عامة موجزة عن المفاهيم الأساسية في هذا المجال.

La introducción del Aprendizaje por Refuerzo Profundo (DRL) y del Aprendizaje por Imitación Profunda (DIL) resalta la importancia de los enfoques basados en el aprendizaje para agentes embebidos, como robots y personajes virtuales, que deben navegar en tareas complejas de toma de decisiones. Este documento enfatiza algoritmos fundamentales como REINFORCE y Optimización de Políticas Proximales, proporcionando una visión concisa de los conceptos esenciales en el campo.

L'introduction de l'apprentissage par renforcement profond (DRL) et de l'apprentissage par imitation profond (DIL) souligne l'importance des approches basées sur l'apprentissage pour les agents incarnés, tels que les robots et les personnages virtuels, qui doivent naviguer dans des tâches de prise de décision complexes. Ce document met l'accent sur des algorithmes fondamentaux comme REINFORCE et l'optimisation de politique proximale, fournissant un aperçu concis des concepts essentiels dans le domaine.

The introduction of Deep Reinforcement Learning (DRL) and Deep Imitation Learning (DIL) highlights the significance of learning-based approaches for embodied agents, such as robots and virtual characters, which must navigate complex decision-making tasks. This document emphasizes foundational algorithms like REINFORCE and Proximal Policy Optimization, providing a concise overview of essential concepts in the field.

Concentration of Cumulative Reward in Markov Decision Processes

Was this article worth reading? Share it

LucidQuery AI

Augmeta

Moadly