Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control

Massachusetts Institute of Technology

DGPPO: How to extend CBFs elegantly for safe MARL.

LidarSpread LidarLine VMASReverseTransport VMASWheel

Abstract

Control policies that can achieve high task performance and satisfy safety contraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we propose DGPPO, a new framework that simultaneously learns both a discrete graph CBF which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with a constant set of hyperparameters across all environments.

Problem setting

Problem setting

We consider the multi-agent constrained optimal control problem with discrete-time, unknown dynamics, partial observability, input constraints, and without a known performant nominal policy. Given \(N\) agents, we aim at design distributed policies \(\mu_1, \dots, \mu_N\), such that:

The task is done: \(\min_{\mu_1,\dots,\mu_N} \sum_{k=0}^\infty l(\mathbf{x}^k, \mathbf{\mu}(\mathbf{x}^k))\),
following the unknown dynamics: \(\mathbf{x}^{k+1} = f(\mathbf{x}^k, \mathbf{\mu}(\mathbf{x}^k))\),
and the agents are safe: \(h_i^{(m)}(o_i^k)\leq 0, \quad o_i^k=O_i(\mathbf{x}^k)\)

Safety: Discrete policy GCBF (DGCBF)

GCBF

DGCBF: safety guarantee for unknown, discrete-time, partially observable multi-agent systems.

Can be learned using policy evaluation with deterministic rollouts:
Learn more about GCBF: Check our T-RO paper GCBF+!

DGPPO: Elegantly combine DGCBF with MARL

algorithm structure

1. We perform a \(T\)-step stochastic rollout with the policy \(\boldsymbol{\pi}_\theta\). However, unlike MAPPO, we additionally perform a \(T\)-step deterministic rollout using a deterministic version of \(\boldsymbol{\pi}_\theta\) (by taking the mode), which we denote \(\boldsymbol\mu\), to learn the DGCBF.
2. We update the value functions via regression on the corresponding targets computed using GAE, where the targets for the cost-value function \(V^l\) uses the stochastic rollout and the targets for the constraint-value functions \(V^{h^{(m)},\boldsymbol\mu}\) use the deterministic rollout.
3. We update the policy \(\boldsymbol\pi_\theta\) by replacing the \(Q\)-function with its GAE, then combining the CRPO-style decoupled policy loss with the PPO clipped loss using the learned constraint-value functions \(V^{h^{(m)},\boldsymbol\mu}\) as the DGCBFs.

\(\hat{C}^{(m)}_{\theta,i} := \max\left\{ 0,\, V^{h^{(m)},\boldsymbol\mu}(o_i^+) - V^{h^{(m)},\boldsymbol\mu}(o_i) + \alpha(V^{h^{(m)},\boldsymbol\mu}(o_i)) \right\}\)
\(\tilde{A}_i := A^{\text{GAE}} \unicode{x1D7D9}_{\max_m \hat{C}_{\theta,i}^{(m)} \leq 0} + \nu \max_m \hat{C}_{\theta ,i}^{(m)} \unicode{x1D7D9}_{\max_m \hat{C}_{\theta ,i}^{(m)} > 0}\)

Simulation Environments

LidarNav LidarSpread LidarLine LidarBicycle

Lidar environments, where agents use LiDAR to detect obstacles.

Transport VMASWheel VMASTransport2

MuJoCo and VMAS environments, where contact dynamics are included.

Numerical Results

main results

Comparison on \(N=3\) agents. DGPPO has the best performance by being closest to the top left corner.

training stability

Training stability. DGPPO yields smoother training curves compared to the baselines.

large scale training

Scaling to \(N=5, 7\). Unlike other methods, DGPPO performs similarly with more agents.

Related Work

This work is built on our previous work GCBF+, which eliminates the requirement of a performant nominal policy and the knowledge of dynamics. For a survey of the field of learning safe control for multi-robot systems, see this paper.