arXiv:2511.18181v1 Announce Type: new 
Abstract: This paper addresses a critical gap in Multi-Objective Multi-Agent Reinforcement Learning (MOMARL) by introducing the first dedicated inner-loop actor-critic framework for continuous state and action spaces: Multi-Objective Multi-Agent Actor-Critic (MOMA-AC). Building on single-objective, single-agent algorithms, we instantiate this framework with Twin Delayed Deep Deterministic Policy Gradient (TD3) and Deep Deterministic Policy Gradient (DDPG), yielding MOMA-TD3 and MOMA-DDPG. The framework combines a multi-headed actor network, a centralised critic, and an objective preference-conditioning architecture, enabling a single neural network to encode the Pareto front of optimal trade-off policies for all agents across conflicting objectives in a continuous MOMARL setting. We also outline a natural test suite for continuous MOMARL by combining a pre-existing multi-agent single-objective physics simulator with its multi-objective single-agent counterpart. Evaluating cooperative locomotion tasks in this suite, we show that our framework achieves statistically significant improvements in expected utility and hypervolume relative to outer-loop and independent training baselines, while demonstrating stable scalability as the number of agents increases. These results establish our framework as a foundational step towards robust, scalable multi-objective policy learning in continuous multi-agent domains.

تم تقديم إطار عمل جديد يسمى MOMA-AC (مؤشر الممثلين المتعددين للأهداف المتعددة) لسد الفجوات في التعلم المعزز متعدد الوكلاء مع أهداف متعددة (MOMARL). يستخدم هذا الإطار خوارزميات TD3 وDDPG، ويتميز بشبكة ممثل متعددة الرؤوس ونقد مركزي لتحسين سياسات التوازن بين الأهداف المتعارضة في البيئات المستمرة.

Se ha introducido un nuevo marco llamado Multi-Objective Multi-Agent Actor-Critic (MOMA-AC) para abordar las brechas en el Aprendizaje por Refuerzo Multi-Agente con Múltiples Objetivos (MOMARL). Este marco utiliza los algoritmos Twin Delayed Deep Deterministic Policy Gradient (TD3) y Deep Deterministic Policy Gradient (DDPG), presentando una red de actores de múltiples cabezas y un crítico centralizado para optimizar políticas de compromiso entre objetivos en conflicto en entornos continuos.

Un nouveau cadre appelé Multi-Objective Multi-Agent Actor-Critic (MOMA-AC) a été introduit pour combler les lacunes dans l'apprentissage par renforcement multi-agent à objectifs multiples (MOMARL). Ce cadre utilise les algorithmes Twin Delayed Deep Deterministic Policy Gradient (TD3) et Deep Deterministic Policy Gradient (DDPG), avec un réseau d'acteurs multi-têtes et un critique centralisé pour optimiser les politiques de compromis entre des objectifs conflictuels dans des environnements continus.

A new framework called Multi-Objective Multi-Agent Actor-Critic (MOMA-AC) has been introduced to address gaps in Multi-Objective Multi-Agent Reinforcement Learning (MOMARL). This framework utilizes Twin Delayed Deep Deterministic Policy Gradient (TD3) and Deep Deterministic Policy Gradient (DDPG) algorithms, featuring a multi-headed actor network and a centralized critic to optimize trade-off policies across conflicting objectives in continuous environments.

MOMA-AC: A preference-driven actor-critic framework for continuous multi-objective multi-agent reinforcement learning

arXiv:2511.18728v1 Announce Type: new 
Abstract: The transition to autonomous material systems necessitates adaptive control methodologies to maximize structural longevity. This study frames the self-healing process as a Reinforcement Learning (RL) problem within a Markov Decision Process (MDP), enabling agents to autonomously derive optimal policies that efficiently balance structural integrity maintenance against finite resource consumption. A comparative evaluation of discrete-action (Q-learning, DQN) and continuous-action (TD3) agents in a stochastic simulation environment revealed that RL controllers significantly outperform heuristic baselines, achieving near-complete material recovery. Crucially, the TD3 agent utilizing continuous dosage control demonstrated superior convergence speed and stability, underscoring the necessity of fine-grained, proportional actuation in dynamic self-healing applications.

أطرَت دراسة حديثة عملية الشفاء الذاتي للأنظمة المادية كمشكلة تعلم تعزيز (RL) ضمن عملية اتخاذ قرار ماركوف (MDP)، مما يُظهر أن الوكلاء في مجال التعلم المعزز يمكنهم استنتاج سياسات مثلى بشكل مستقل للحفاظ على سلامة الهيكل مع إدارة استهلاك الموارد. أبرزت الأبحاث الأداء المتفوق للوكلاء ذوي الإجراءات المستمرة، وخاصة وكيل TD3، في تحقيق استعادة المواد شبه الكاملة مقارنة بالأساليب التقليدية.

Un estudio reciente ha enmarcado el proceso de auto-reparación de sistemas de materiales como un problema de Aprendizaje por Refuerzo (RL) dentro de un Proceso de Decisión de Markov (MDP), demostrando que los agentes de RL pueden derivar de manera autónoma políticas óptimas para mantener la integridad estructural mientras gestionan el consumo de recursos. La investigación destacó el rendimiento superior de los agentes de acción continua, en particular el agente TD3, al lograr una recuperación material casi completa en comparación con métodos heurísticos tradicionales.

Une étude récente a encadré le processus d'auto-réparation des systèmes de matériaux comme un problème d'apprentissage par renforcement (RL) au sein d'un processus de décision de Markov (MDP), démontrant que les agents RL peuvent dériver de manière autonome des politiques optimales pour maintenir l'intégrité structurelle tout en gérant la consommation de ressources. La recherche a mis en évidence la performance supérieure des agents à action continue, en particulier l'agent TD3, dans l'atteinte d'une récupération matérielle quasi complète par rapport aux méthodes heuristiques traditionnelles.

A recent study has framed the self-healing process of material systems as a Reinforcement Learning (RL) problem within a Markov Decision Process (MDP), demonstrating that RL agents can autonomously derive optimal policies for maintaining structural integrity while managing resource consumption. The research highlighted the superior performance of continuous-action agents, particularly the TD3 agent, in achieving near-complete material recovery compared to traditional heuristic methods.

Reinforcement Learning for Self-Healing Material Systems

arXiv:2511.19165v1 Announce Type: new 
Abstract: We propose a refinement of temporal-difference learning that enforces first-order Bellman consistency: the learned value function is trained to match not only the Bellman targets in value but also their derivatives with respect to states and actions. By differentiating the Bellman backup through differentiable dynamics, we obtain analytically consistent gradient targets. Incorporating these into the critic objective using a Sobolev-type loss encourages the critic to align with both the value and local geometry of the target function. This first-order TD matching principle can be seamlessly integrated into existing algorithms, such as Q-learning or actor-critic methods (e.g., DDPG, SAC), potentially leading to faster critic convergence and more stable policy gradients without altering their overall structure.

تم اقتراح تحسين جديد في التعلم بالفرق الزمني، مع التركيز على اتساق بيلمان من الدرجة الأولى. يهدف هذا النهج إلى تدريب دالة القيمة المكتسبة لتتوافق مع أهداف بيلمان ومشتقاتها، مما يعزز استقرار وتوافق خوارزميات التعلم المعزز مثل Q-learning وطرق الممثل-الناقد.

Se ha propuesto un nuevo refinamiento en el aprendizaje por diferencia temporal, enfatizando la consistencia de Bellman de primer orden. Este enfoque entrena la función de valor aprendida para alinearse tanto con los objetivos de Bellman como con sus derivadas, mejorando la estabilidad y la convergencia de algoritmos de aprendizaje por refuerzo como Q-learning y métodos actor-crítico.

Une nouvelle amélioration de l'apprentissage par différence temporelle a été proposée, mettant l'accent sur la cohérence de Bellman d'ordre supérieur. Cette approche forme la fonction de valeur apprise pour s'aligner à la fois sur les cibles de Bellman et sur leurs dérivées, améliorant ainsi la stabilité et la convergence des algorithmes d'apprentissage par renforcement tels que Q-learning et les méthodes acteur-critique.

A new refinement in temporal-difference learning has been proposed, emphasizing first-order Bellman consistency. This approach trains the learned value function to align with both the Bellman targets and their derivatives, enhancing the stability and convergence of reinforcement learning algorithms like Q-learning and actor-critic methods.

MOMA-AC: A preference-driven actor-critic framework for continuous multi-objective multi-agent reinforcement learning

Was this article worth reading? Share it

Https

AIvilization

Guidejar-4eb95b