Solving Multi-Agent Safe Optimal Control with Distributed Epigraph Form MARL

*Equal contribution     1Massachusetts Institute of Technology     2MIT Lincoln Laboratory

Def-MARL: A more stable safe MARL framework for zero constraint violation setting.

Crazyflies crossing a narrow corridor using Def-MARL

razyflies crossing a narrow corridor using Def-MARL

Crazyflies inspecting the target collaborately using Def-MARL

Crazyflies inspecting the target collaborately using Def-MARL

Simulations using Def-MARL

Abstract

Tasks for multi-robot systems often require the robots to collaborate and complete a team goal while maintaining safety. This problem is usually formalized as a constrained Markov decision process (CMDP), which targets minimizing a global cost and bringing the mean of constraint violation below a user-defined threshold. Inspired by real-world robotic applications, we define safety as zero constraint violation. While many safe multi-agent reinforcement learning (MARL) algorithms have been proposed to solve CMDPs, these algorithms suffer from unstable training in this setting. To tackle this, we use the epigraph form for constrained optimization to improve training stability and prove that the centralized epigraph form problem can be solved in a distributed fashion by each agent. This results in a novel centralized training distributed execution MARL algorithm named Def-MARL. Simulation experiments on 8 different tasks across 2 different simulators show that Def-MARL achieves the best overall performance, satisfies safety constraints, and maintains stable training. Real-world hardware experiments on Crazyflie quadcopters demonstrate the ability of Def-MARL to safely coordinate agents to complete complex collaborative tasks compared to other methods.

Problem setting

Problem setting

We consider the multi-agent safe optimal control problem (MASOCP) with discrete-time, unknown dynamics, partial observability, input constraints. Given \(N\) agents, we aim at design distributed policies \(\mu_1, \dots, \mu_N\), such that:

The task is done: \(\min_{\pi_1,\dots,\pi_N} \sum_{k=0}^\infty l({x}^k, \pi({x}^k))\),
following the unknown dynamics: \({x}^{k+1} = f({x}^k, {\pi}({x}^k))\),
and the agents are safe: \(h_i(o_i^k)\leq 0, \quad o_i^k=O_i({x}^k)\)

Epigraph Form

As Lagrangian methods usually suffer from unstable training, we use the epigraph form for constrained optimization to improve training stability. First, we deinfe the cost-value function \(V^l\) using the standard optimal control formulation: \[ V^l(x^\tau; \pi) = \sum_{k\geq\tau} l(x^k, \pi(x^k)). \] We also define the constraint-value function \(V^{h}\) as the maximum constraint violation: \[ V^{h}(x^\tau; \pi) = \max_{k\geq\tau}h(x^k) = \max_{k\geq\tau}\max_i h_i(o_i^k) = \max_i\max_{k\geq\tau} h_i(o_i^k) = \max_i V^{h}_i(o_i^\tau; \pi). \] Then, we can rewrite the MASOCP as: \[ \min_{\pi_1,\dots,\pi_N} V^l(x^0; \pi) \quad \text{s.t. } V^{h}(x^0; \pi) \leq 0. \] The epigraph form then takes the form: \[ \min_z \; z \quad \text{s.t. } \min_{\pi_1,\dots,\pi_N} V(x^0, z; \pi) \leq 0, \] where \(V(x^0, z; \pi) = \max\left\{\max_i V_i^h(o_i^\tau;\pi),V^l(x^\tau;\pi)-z\right\}\).

Distributed Epigraph Form

The epigraph form can be solved in a distributed fashion by each agent. First, we define the total value function for each agent as \[ V_i(x^\tau, z; \pi) = \max\left\{V_i^h(o_i^\tau;\pi),V^l(x^\tau;\pi)-z\right\}. \] Then, we can rewrite the epigraph form as: \[ \min_{z} \; z \quad \text{s.t. } \min_{\pi_1,\dots,\pi_N} \max_i V_i(x^0, z; \pi) \leq 0. \] This decomposes the original problem into an unconstrained inner problem over policy \(\pi\) and a constrained outer problem over \(z\). During offline training, we solve the inner problem: for parameter \(z\), find the optimal policy \(\pi(\cdot,z)\) to minimize \(V(x^0,z;\pi)\). Note that the optimal policy of the inner problem depends on \(z\). During execution, we solve the outer problem online to get the minimal \(z\) that satisfies constraints. Using this \(z\) in the \(z\)-conditioned policy \(\pi(\cdot,z)\) found in the inner problem gives us the optimal policy for the overall epigraph form MASOCP (EF-MASOCP).

Solving the outer problem during distributed execution

First, we provide the following theorem as the fundation of the distributed execution:

Theorem 1: Assume no two unique values of \(z\) achieves the same unique cost. Then, the outer problem of EF-MASOCP is equivalent to the following: \[ \begin{aligned} z &= \max_i \; z_i \\ z_i &= \min_{z'} \; z' \quad \text{s.t. } V_i(x^0, z'; \pi) \leq 0. \end{aligned} \]

This enables computing \(z\) without the use of the centralized \(V^l\) during execution. Specifically, each agent \(i\) solves the local problem for \(z_i\), which is a 1-dimensional optimization problem and can be efficiently solved using root-finding methods, then communicates \(z_i\) among the other agents to obtain the maximum. Furthermore, we observe experimentally that the agents can achieve low cost while maintaining safety even if \(z_i\) is not communicated. Thus, we do not include \(z_i\) communication for our method.

Def-MARL: Overall framework

algorithm structure

Simulation Environments

MPETarget MPESpread MPEFormation MPELine MPECorridor MPEConnectSpread env_legend

MPE environments.

multicheetah coupledcheetah

MuJoCo environments, where contact dynamics are included.

Numerical Results

main results

Comparison on \(N=3\) agents. Def-MARL has the best performance by being closest to the top left corner.

training stability

Training stability. Def-MARL yields smoother training curves compared to the baselines.

large scale training

Results on larger-scale MPE. Unlike other methods, Def-MARL maintains the best performance.

Related Work

This work is part of our line of work on designing safe and intelligent control policies for multi-agent systems. Other works on this line includes:

DGPPO: How to extend CBFs elegantly for safe MARL.
GCBFv0: Generalizable distributed safe controllers for 1000+ agents.
GCBF+: Generalizable distributed safe controllers for 1000+ agents, an improved version of GCBFv0.

For a survey of the field of learning safe control for multi-robot systems, see this paper.

BibTeX

@inproceedings{zhang2025defmarl,
      title={Solving Multi-Agent Safe Optimal Control with Distributed Epigraph Form MARL},
      author={Zhang, Songyuan and So, Oswin and Black, Mitchell and Serlin, Zachary and Fan, Chuchu},
      booktitle={Proceedings of Robotics: Science and Systems},
      year={2025},
}