We consider the multi-agent constrained optimal control problem with discrete-time, unknown dynamics, partial observability, input constraints, and without a known performant nominal policy. Given \(N\) agents, we aim at design distributed policies \(\mu_1, \dots, \mu_N\), such that:
The task is done: \(\min_{\mu_1,\dots,\mu_N} \sum_{k=0}^\infty l(\mathbf{x}^k, \mathbf{\mu}(\mathbf{x}^k))\),
following the unknown dynamics: \(\mathbf{x}^{k+1} = f(\mathbf{x}^k, \mathbf{\mu}(\mathbf{x}^k))\),
and the agents are safe: \(h_i^{(m)}(o_i^k)\leq 0, \quad o_i^k=O_i(\mathbf{x}^k)\)
DGCBF: safety guarantee for unknown, discrete-time, partially observable multi-agent systems.
Can be learned using policy evaluation with deterministic rollouts:
Learn more about GCBF: Check our T-RO paper GCBF+!
1. We perform a \(T\)-step stochastic rollout with the policy \(\boldsymbol{\pi}_\theta\). However, unlike MAPPO, we additionally perform a \(T\)-step deterministic rollout using a deterministic version of \(\boldsymbol{\pi}_\theta\) (by taking the mode), which we denote \(\boldsymbol\mu\), to learn the DGCBF.
2. We update the value functions via regression on the corresponding targets computed using GAE, where the targets for the cost-value function \(V^l\) uses the stochastic rollout and the targets for the constraint-value functions \(V^{h^{(m)},\boldsymbol\mu}\) use the deterministic rollout.
3. We update the policy \(\boldsymbol\pi_\theta\) by replacing the \(Q\)-function with its GAE, then combining the CRPO-style decoupled policy loss with the PPO clipped loss using the learned constraint-value functions \(V^{h^{(m)},\boldsymbol\mu}\) as the DGCBFs.
Lidar environments, where agents use LiDAR to detect obstacles.
MuJoCo and VMAS environments, where contact dynamics are included.
Comparison on \(N=3\) agents. DGPPO has the best performance by being closest to the top left corner.
Training stability. DGPPO yields smoother training curves compared to the baselines.
Scaling to \(N=5, 7\). Unlike other methods, DGPPO performs similarly with more agents.
This work is built on our previous work GCBF+, which eliminates the requirement of a performant nominal policy and the knowledge of dynamics. For a survey of the field of learning safe control for multi-robot systems, see this paper.