arXiv:2511.02762v1 Announce Type: new 
Abstract: Training a team of agents from scratch in multi-agent reinforcement learning (MARL) is highly inefficient, much like asking beginners to play a symphony together without first practicing solo. Existing methods, such as offline or transferable MARL, can ease this burden, but they still rely on costly multi-agent data, which often becomes the bottleneck. In contrast, solo experiences are far easier to obtain in many important scenarios, e.g., collaborative coding, household cooperation, and search-and-rescue. To unlock their potential, we propose Solo-to-Collaborative RL (SoCo), a framework that transfers solo knowledge into cooperative learning. SoCo first pretrains a shared solo policy from solo demonstrations, then adapts it for cooperation during multi-agent training through a policy fusion mechanism that combines an MoE-like gating selector and an action editor. Experiments across diverse cooperative tasks show that SoCo significantly boosts the training efficiency and performance of backbone algorithms. These results demonstrate that solo demonstrations provide a scalable and effective complement to multi-agent data, making cooperative learning more practical and broadly applicable.

تقدم نهجًا جديدًا لتعلم التعزيز متعدد الوكلاء يقترح أن تدريب الوكلاء بشكل فردي قبل أن يتعاونوا يمكن أن يحسن الكفاءة بشكل كبير. تعالج هذه الطريقة التحديات المتعلقة بالبيانات متعددة الوكلاء المكلفة، مما يسهل جمع التجارب الفردية، التي تعتبر ضرورية للعمل الجماعي الفعال.

Un nuevo enfoque para el aprendizaje por refuerzo multiagente sugiere que entrenar a los agentes individualmente antes de que colaboren puede mejorar significativamente la eficiencia. Este método aborda los desafíos de los costosos datos multiagente, facilitando la obtención de experiencias en solitario, que son cruciales para un trabajo en equipo efectivo.

Une nouvelle approche de l'apprentissage par renforcement multi-agents suggère que former des agents individuellement avant qu'ils ne collaborent peut améliorer considérablement l'efficacité. Cette méthode répond aux défis des données multi-agents coûteuses, facilitant la collecte d'expériences en solo, essentielles pour un travail d'équipe efficace.

A new approach to multi-agent reinforcement learning suggests that training agents individually before they collaborate can significantly improve efficiency. This method addresses the challenges of costly multi-agent data, making it easier to gather solo experiences, which are crucial for effective teamwork.

From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos

arXiv:2512.00351v1 Announce Type: cross 
Abstract: The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm \emph{Memory-Efficient Nash Q-Learning (ME-Nash-QL)} for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an $\varepsilon$-approximate Nash policy with space complexity $O(SABH)$ and sample complexity $\widetilde{O}(H^4SAB/\varepsilon^2)$, where $S$ is the number of states, $\{A, B\}$ is the number of actions for two players, and $H$ is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when $\min\{A, B\}\ll H^2$. Second, ME-Nash-QL achieves the lowest computational complexity $O(T\mathrm{poly}(AB))$ while preserving Markov policies, where $T$ is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost $O(SAB\,\mathrm{poly}(H))$, whereas previous algorithms have a burn-in cost of at least $O(S^3 AB\,\mathrm{poly}(H))$ to attain the same level of sample complexity with ours.

تم تقديم خوارزمية جديدة للعب الذاتي بدون نموذج، وهي Memory-Efficient Nash Q-Learning (ME-Nash-QL)، لألعاب ماركوف ذات الصفر المزدوج للاعبين، حيث تعالج التحديات الرئيسية في التعلم المعزز متعدد الوكلاء (MARL) مثل عدم كفاءة الذاكرة وارتفاع التعقيد الحسابي. تم تصميم هذه الخوارزمية لإنتاج سياسة ناش تقريبية مع تعقيد مساحة وتعقيد عينة مخفضين بشكل كبير.

Se ha introducido un nuevo algoritmo de auto-juego sin modelo, el Memory-Efficient Nash Q-Learning (ME-Nash-QL), para juegos de Markov de suma cero de dos jugadores, abordando desafíos clave en el aprendizaje por refuerzo multi-agente (MARL), como la ineficiencia de la memoria y la alta complejidad computacional. Este algoritmo está diseñado para producir una política de Nash aproximada de $	ext{ε}$ con una complejidad de espacio y muestra significativamente reducida.

Un nouvel algorithme d'auto-jeu sans modèle, le Memory-Efficient Nash Q-Learning (ME-Nash-QL), a été introduit pour les jeux de Markov à somme nulle à deux joueurs, abordant des défis clés dans l'apprentissage par renforcement multi-agent (MARL) tels que l'inefficacité de la mémoire et la complexité computationnelle élevée. Cet algorithme est conçu pour produire une politique de Nash approximative $	ext{ε}$ avec une complexité d'espace et d'échantillonnage considérablement réduite.

A new model-free self-play algorithm, Memory-Efficient Nash Q-Learning (ME-Nash-QL), has been introduced for two-player zero-sum Markov games, addressing key challenges in multi-agent reinforcement learning (MARL) such as memory inefficiency and high computational complexity. This algorithm is designed to produce an $	ext{ε}$-approximate Nash policy with significantly reduced space and sample complexity.

Provable Memory Efficient Self-Play Algorithm for Model-free Reinforcement Learning

arXiv:2512.00352v1 Announce Type: cross 
Abstract: Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties, policies in MARL must remain robust to tackle the sim-to-real gap. We focus on robust two-player zero-sum Markov games (TZMGs) in offline settings, specifically on tabular robust TZMGs (RTZMGs). We propose a model-based algorithm (\textit{RTZ-VI-LCB}) for offline RTZMGs, which is optimistic robust value iteration combined with a data-driven Bernstein-style penalty term for robust value estimation. By accounting for distribution shifts in the historical dataset, the proposed algorithm establishes near-optimal sample complexity guarantees under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to confirm the tightness of our algorithm's sample complexity, which is optimal regarding both state and action spaces. To the best of our knowledge, RTZ-VI-LCB is the first to attain this optimality, sets a new benchmark for offline RTZMGs, and is validated experimentally.

تم اقتراح خوارزمية جديدة قائمة على النموذج، RTZ-VI-LCB، لألعاب ماركوف ذات الصفر المزدوج في الإعدادات غير المتصلة، مع التركيز على اللعب الذاتي الجدولي الفعال من حيث العينة للتعلم المعزز متعدد الوكلاء. تجمع هذه الخوارزمية بين تكرار القيمة القوي المتفائل مع مصطلح عقوبة قائم على البيانات لتحسين تقدير القيمة القوي تحت عدم اليقين البيئي.

Se ha propuesto un nuevo algoritmo basado en modelos, RTZ-VI-LCB, para juegos de Markov de suma cero de dos jugadores en entornos offline, centrándose en el auto-juego tabular eficiente en muestras para el aprendizaje por refuerzo multiagente. Este algoritmo combina la iteración de valor robusto optimista con un término de penalización basado en datos para mejorar la estimación de valor robusto bajo incertidumbres ambientales.

Un nouvel algorithme basé sur un modèle, RTZ-VI-LCB, a été proposé pour les jeux de Markov à somme nulle à deux joueurs dans des contextes hors ligne, se concentrant sur l'auto-jeu tabulaire efficace en échantillons pour l'apprentissage par renforcement multi-agent. Cet algorithme combine une itération de valeur robuste optimiste avec un terme de pénalité basé sur les données pour améliorer l'estimation de valeur robuste sous des incertitudes environnementales.

A new model-based algorithm, RTZ-VI-LCB, has been proposed for robust two-player zero-sum Markov games in offline settings, focusing on sample-efficient tabular self-play for multi-agent reinforcement learning. This algorithm combines optimistic robust value iteration with a data-driven penalty term to enhance robust value estimation under environmental uncertainties.

From Solo to Symphony: Orchestrating Multi-Agent Collaboration with Single-Agent Demos

Was this article worth reading? Share it

Legion AI

Fluum

Synthx