arXiv:2412.07057v2 Announce Type: replace 
Abstract: Imitation learning (IL) is a paradigm for learning sequential decision making policies from experts, leveraging offline demonstrations, interactive annotations, or both. Recent advances show that when annotation cost is tallied per trajectory, Behavior Cloning (BC) which relies solely on offline demonstrations cannot be improved in general, leaving limited conditions for interactive methods such as DAgger to help. We revisit this conclusion and prove that when the annotation cost is measured per state, algorithms using interactive annotations can provably outperform BC. Specifically: (1) we show that Stagger, a one sample per round variant of DAgger, provably beats BC under low recovery cost settings; (2) we initiate the study of hybrid IL where the agent learns from offline demonstrations and interactive annotations. We propose Warm Stagger whose learning guarantee is not much worse than using either data source alone. Furthermore, motivated by compounding error and cold start problem in imitation learning practice, we give an MDP example in which Warm Stagger has significant better annotation cost; (3) experiments on MuJoCo continuous control tasks confirm that, with modest cost ratio between interactive and offline annotations, interactive and hybrid approaches consistently outperform BC. To the best of our knowledge, our work is the first to highlight the benefit of state wise interactive annotation and hybrid feedback in imitation learning.

أظهرت الأبحاث الأخيرة في التعلم بالتقليد (IL) أن الأساليب التفاعلية يمكن أن تتفوق على تقليد السلوك (BC) التقليدي عندما يتم قياس تكاليف التوثيق لكل حالة. تقدم الدراسة خوارزميات مثل Stagger وWarm Stagger، التي تستفيد من كل من العروض التوضيحية غير المتصلة بالإنترنت والتعليقات التفاعلية لتحسين كفاءة التعلم.

Investigaciones recientes en aprendizaje por imitación (IL) han demostrado que los métodos interactivos pueden superar al Clonaje de Comportamiento (BC) tradicional cuando los costos de anotación se miden por estado. El estudio presenta algoritmos como Stagger y Warm Stagger, que aprovechan tanto las demostraciones fuera de línea como las anotaciones interactivas para mejorar la eficiencia del aprendizaje.

Des recherches récentes en apprentissage par imitation (IL) ont démontré que les méthodes interactives peuvent surpasser le clonage de comportement (BC) traditionnel lorsque les coûts d'annotation sont mesurés par état. L'étude introduit des algorithmes tels que Stagger et Warm Stagger, qui tirent parti à la fois des démonstrations hors ligne et des annotations interactives pour améliorer l'efficacité de l'apprentissage.

Recent research in imitation learning (IL) has demonstrated that interactive methods can outperform traditional Behavior Cloning (BC) when annotation costs are measured per state. The study introduces algorithms like Stagger and Warm Stagger, which leverage both offline demonstrations and interactive annotations to enhance learning efficiency.

Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning

arXiv:2512.03556v1 Announce Type: cross 
Abstract: Achieving generalizable embodied policies remains a key challenge. Traditional policy learning paradigms, including both Imitation Learning (IL) and Reinforcement Learning (RL), struggle to cultivate generalizability across diverse scenarios. While IL policies often overfit to specific expert trajectories, RL suffers from the inherent lack of a unified and general reward signal necessary for effective multi-scene generalization. We posit that the world model is uniquely capable of serving as a universal environment proxy to address this limitation. However, current world models primarily focus on their ability to predict observations and still rely on task-specific, handcrafted reward functions, thereby failing to provide a truly general training environment. Toward this problem, we propose RoboScape-R, a framework leveraging the world model to serve as a versatile, general-purpose proxy for the embodied environment within the RL paradigm. We introduce a novel world model-based general reward mechanism that generates ''endogenous'' rewards derived from the model's intrinsic understanding of real-world state transition dynamics. Extensive experiments demonstrate that RoboScape-R effectively addresses the limitations of traditional RL methods by providing an efficient and general training environment that substantially enhances the generalization capability of embodied policies. Our approach offers critical insights into utilizing the world model as an online training strategy and achieves an average 37.5% performance improvement over baselines under out-of-domain scenarios.

تمثل مقدمة RoboScape-R تقدمًا كبيرًا في مجال الروبوتات، حيث تقترح نموذجًا موحدًا للعالم يعتمد على المكافآت والملاحظات بهدف تحسين التدريب القابل للتعميم من خلال التعلم المعزز (RL). يتناول هذا الإطار القيود المفروضة على طرق تعلم السياسات التقليدية، التي غالبًا ما تكافح من أجل التعميم عبر سيناريوهات متنوعة. من خلال الاستفادة من نموذج العالم كوكيل بيئي عالمي، يسعى RoboScape-R إلى إنشاء بيئة تدريب أكثر قابلية للتكيف للأنظمة الروبوتية.

La introducción de RoboScape-R marca un avance significativo en el campo de la robótica, proponiendo un modelo de mundo unificado de recompensa-observación destinado a mejorar el entrenamiento generalizable a través del aprendizaje por refuerzo (RL). Este marco aborda las limitaciones de los métodos tradicionales de aprendizaje de políticas, que a menudo luchan por generalizar en diversos escenarios. Al aprovechar un modelo de mundo como un proxy ambiental universal, RoboScape-R busca crear un entorno de entrenamiento más adaptable para los sistemas robóticos.

L'introduction de RoboScape-R représente une avancée significative dans le domaine de la robotique, proposant un modèle de monde unifié récompense-observation visant à améliorer l'entraînement généralisable via l'apprentissage par renforcement (RL). Ce cadre aborde les limitations des méthodes d'apprentissage par politique traditionnelles, qui peinent souvent à se généraliser à travers des scénarios divers. En s'appuyant sur un modèle de monde comme proxy environnemental universel, RoboScape-R cherche à créer un environnement d'entraînement plus adaptable pour les systèmes robotiques.

The introduction of RoboScape-R marks a significant advancement in the field of robotics, proposing a unified reward-observation world model aimed at enhancing generalizable training through reinforcement learning (RL). This framework addresses the limitations of traditional policy learning methods, which often struggle with generalization across diverse scenarios. By leveraging a world model as a universal environment proxy, RoboScape-R seeks to create a more adaptable training environment for robotic systems.

RoboScape-R: Unified Reward-Observation World Models for Generalizable Robotics Training via RL

arXiv:2408.01402v2 Announce Type: replace 
Abstract: Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks, leveraging pre-collected datasets and Transformer's capability to model long sequences. Recent works have demonstrated that using parts of trajectories from training tasks as prompts in DT enhances its performance on unseen tasks, giving rise to Prompt-DT methods. However, collecting data from specific environments can be both costly and unsafe in many scenarios, leading to suboptimal performance and limited few-shot prompt abilities due to the data-hungry nature of Transformer-based models. Additionally, the limited datasets used in pre-training make it challenging for Prompt-DT type of methods to distinguish between various RL tasks through prompts alone. To address these challenges, we introduce the Language model-initialized Prompt Decision Transformer (LPDT) framework, which leverages pretrained language models providing rich prior knowledge for RL tasks and fine-tunes the sequence model using Low-rank Adaptation (LoRA) for meta-RL problems. We further incorporate prompt regularization to effectively differentiate between tasks based on prompt feature representations. Comprehensive empirical studies demonstrate that initializing with a pre-trained language model provides the prior knowledge and achieves a similar performance with Prompt-DT under only $10\%$ data in some MuJoCo control tasks. We also provide a thorough ablation study to validate the effectiveness of each component, including sequence modeling, language models, prompt regularizations, and prompt strategies.

يمثل تقديم إطار عمل Language model-initialized Prompt Decision Transformer (LPDT) تقدمًا كبيرًا في التعلم المعزز غير المتصل (RL) من خلال تحسين قدرة النماذج على استخدام المحفزات القليلة. يستخدم هذا الإطار نماذج اللغة المدربة مسبقًا لتحسين الأداء في المهام غير المرئية، مما يعالج التحديات المتعلقة بجمع البيانات والقيود المفروضة على طرق Prompt-DT التقليدية.

La introducción del marco Language model-initialized Prompt Decision Transformer (LPDT) marca un avance significativo en el aprendizaje por refuerzo offline (RL) al mejorar la capacidad de prompt de pocos disparos de los Decision Transformers. Este marco utiliza modelos de lenguaje preentrenados para mejorar el rendimiento en tareas no vistas, abordando desafíos relacionados con la recopilación de datos y las limitaciones de los métodos tradicionales de Prompt-DT.

L'introduction du cadre Language model-initialized Prompt Decision Transformer (LPDT) représente une avancée significative dans l'apprentissage par renforcement hors ligne (RL) en améliorant la capacité de prompt à quelques exemples des Decision Transformers. Ce cadre utilise des modèles de langage pré-entraînés pour améliorer les performances sur des tâches non vues, en abordant les défis liés à la collecte de données et aux limitations des méthodes Prompt-DT traditionnelles.

The introduction of the Language model-initialized Prompt Decision Transformer (LPDT) framework marks a significant advancement in offline reinforcement learning (RL) by enhancing the few-shot prompt ability of Decision Transformers. This framework utilizes pre-trained language models to improve performance on unseen tasks, addressing challenges related to data collection and the limitations of traditional Prompt-DT methods.

Interactive and Hybrid Imitation Learning: Provably Beating Behavior Cloning

Was this article worth reading? Share it

LucidQuery AI

Augmeta

Tombot Spark