arXiv:2505.20922v2 Announce Type: replace-cross 
Abstract: World models have recently attracted growing interest in Multi-Agent Reinforcement Learning (MARL) due to their ability to improve sample efficiency for policy learning. However, accurately modeling environments in MARL is challenging due to the exponentially large joint action space and highly uncertain dynamics inherent in multi-agent systems. To address this, we reduce modeling complexity by shifting from jointly modeling the entire state-action transition dynamics to focusing on the state space alone at each timestep through sequential agent modeling. Specifically, our approach enables the model to progressively resolve uncertainty while capturing the structured dependencies among agents, providing a more accurate representation of how agents influence the state. Interestingly, this sequential revelation of agents' actions in a multi-agent system aligns with the reverse process in diffusion models--a class of powerful generative models known for their expressiveness and training stability compared to autoregressive or latent variable models. Leveraging this insight, we develop a flexible and robust world model for MARL using diffusion models. Our method, Diffusion-Inspired Multi-Agent world model (DIMA), achieves state-of-the-art performance across multiple multi-agent control benchmarks, significantly outperforming prior world models in terms of final return and sample efficiency, including MAMuJoCo and Bi-DexHands. DIMA establishes a new paradigm for constructing multi-agent world models, advancing the frontier of MARL research. Codes are open-sourced at https://github.com/breez3young/DIMA.

تسلط دراسة حديثة حول التعلم المعزز متعدد الوكلاء (MARL) الضوء على إمكانيات نماذج العالم في تحسين كفاءة العينة في تعلم السياسات. تتناول الأبحاث تعقيدات نمذجة البيئات بدقة في MARL، والتي غالبًا ما تواجه تحديات بسبب المساحات الكبيرة للعمل المشترك والديناميات غير المؤكدة. من خلال اعتماد نهج مستوحى من الانتشار، تهدف الدراسة إلى تبسيط هذه النماذج، مما يسهل على الوكلاء التعلم والتكيف. هذه التطورات مهمة لأنها قد تؤدي إلى استراتيجيات تعلم أكثر فعالية في الأنظمة متعددة الوكلاء، مما يمهد الطريق لتطبيقات محسنة في مجالات متنوعة.

Un estudio reciente sobre el Aprendizaje por Refuerzo Multiagente (MARL) destaca el potencial de los modelos del mundo para mejorar la eficiencia de las muestras en el aprendizaje de políticas. La investigación aborda las complejidades de modelar con precisión los entornos en MARL, que a menudo enfrentan desafíos debido a los vastos espacios de acción conjunta y las dinámicas inciertas. Al adoptar un enfoque inspirado en la difusión, el estudio busca simplificar estos modelos, facilitando el aprendizaje y la adaptación de los agentes. Este avance es significativo, ya que podría conducir a estrategias de aprendizaje más efectivas en sistemas multiagente, allanando el camino para mejores aplicaciones en diversos campos.

Une étude récente sur l'apprentissage par renforcement multi-agents (MARL) met en lumière le potentiel des modèles du monde pour améliorer l'efficacité des échantillons dans l'apprentissage des politiques. La recherche aborde les complexités de la modélisation précise des environnements en MARL, qui font souvent face à des défis en raison des vastes espaces d'actions conjoints et des dynamiques incertaines. En adoptant une approche inspirée de la diffusion, l'étude vise à simplifier ces modèles, facilitant ainsi l'apprentissage et l'adaptation des agents. Cette avancée est significative car elle pourrait conduire à des stratégies d'apprentissage plus efficaces dans les systèmes multi-agents, ouvrant la voie à de meilleures applications dans divers domaines.

A recent study on Multi-Agent Reinforcement Learning (MARL) highlights the potential of world models to enhance sample efficiency in policy learning. The research addresses the complexities of accurately modeling environments in MARL, which often face challenges due to vast joint action spaces and uncertain dynamics. By adopting a diffusion-inspired approach, the study aims to simplify these models, making it easier for agents to learn and adapt. This advancement is significant as it could lead to more effective and efficient learning strategies in multi-agent systems, paving the way for improved applications in various fields.

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

arXiv:2512.00351v1 Announce Type: cross 
Abstract: The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min\{A, B\}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.

تم تقديم خوارزمية جديدة للعب الذاتي بدون نموذج، وهي Memory-Efficient Nash Q-Learning (ME-Nash-QL)، لألعاب ماركوف ذات الصفر المزدوج للاعبين، حيث تعالج التحديات الرئيسية في التعلم المعزز متعدد الوكلاء (MARL) مثل عدم كفاءة الذاكرة وارتفاع التعقيد الحسابي. تم تصميم هذه الخوارزمية لإنتاج سياسة ناش تقريبية مع تعقيد مساحة وتعقيد عينة مخفضين بشكل كبير.

Se ha introducido un nuevo algoritmo de auto-juego sin modelo, el Memory-Efficient Nash Q-Learning (ME-Nash-QL), para juegos de Markov de suma cero de dos jugadores, abordando desafíos clave en el aprendizaje por refuerzo multi-agente (MARL), como la ineficiencia de la memoria y la alta complejidad computacional. Este algoritmo está diseñado para producir una política de Nash aproximada de $	ext{ε}$ con una complejidad de espacio y muestra significativamente reducida.

Un nouvel algorithme d'auto-jeu sans modèle, le Memory-Efficient Nash Q-Learning (ME-Nash-QL), a été introduit pour les jeux de Markov à somme nulle à deux joueurs, abordant des défis clés dans l'apprentissage par renforcement multi-agent (MARL) tels que l'inefficacité de la mémoire et la complexité computationnelle élevée. Cet algorithme est conçu pour produire une politique de Nash approximative $	ext{ε}$ avec une complexité d'espace et d'échantillonnage considérablement réduite.

A new model-free self-play algorithm, Memory-Efficient Nash Q-Learning (ME-Nash-QL), has been introduced for two-player zero-sum Markov games, addressing key challenges in multi-agent reinforcement learning (MARL) such as memory inefficiency and high computational complexity. This algorithm is designed to produce an $	ext{ε}$-approximate Nash policy with significantly reduced space and sample complexity.

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

arXiv:2512.00352v1 Announce Type: cross 
Abstract: Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (\textit{RTZ-VI-LCB}) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.

تم اقتراح خوارزمية جديدة قائمة على النموذج، RTZ-VI-LCB، لألعاب ماركوف ذات الصفر المزدوج في الإعدادات غير المتصلة، مع التركيز على اللعب الذاتي الجدولي الفعال من حيث العينة للتعلم المعزز متعدد الوكلاء. تجمع هذه الخوارزمية بين تكرار القيمة القوي المتفائل مع مصطلح عقوبة قائم على البيانات لتحسين تقدير القيمة القوي تحت عدم اليقين البيئي.

Se ha propuesto un nuevo algoritmo basado en modelos, RTZ-VI-LCB, para juegos de Markov de suma cero de dos jugadores en entornos offline, centrándose en el auto-juego tabular eficiente en muestras para el aprendizaje por refuerzo multiagente. Este algoritmo combina la iteración de valor robusto optimista con un término de penalización basado en datos para mejorar la estimación de valor robusto bajo incertidumbres ambientales.

Un nouvel algorithme basé sur un modèle, RTZ-VI-LCB, a été proposé pour les jeux de Markov à somme nulle à deux joueurs dans des contextes hors ligne, se concentrant sur l'auto-jeu tabulaire efficace en échantillons pour l'apprentissage par renforcement multi-agent. Cet algorithme combine une itération de valeur robuste optimiste avec un terme de pénalité basé sur les données pour améliorer l'estimation de valeur robuste sous des incertitudes environnementales.

A new model-based algorithm, RTZ-VI-LCB, has been proposed for robust two-player zero-sum Markov games in offline settings, focusing on sample-efficient tabular self-play for multi-agent reinforcement learning. This algorithm combines optimistic robust value iteration with a data-driven penalty term to enhance robust value estimation under environmental uncertainties.

Revisiting Multi-Agent World Modeling from a Diffusion-Inspired Perspective

Was this article worth reading? Share it

AIPortalX

AIvilization

Https