Solving Stabilize-Avoid Optimal Control via Epigraph Form and Deep Reinforcement Learning

Oswin So, Chuchu Fan
Massachusetts Institute of Technology

Safe and stable controller synthesis for arbitrary dynamics

Stabilize-Avoid with Constrained Optimal Control

Tackling safety using constraints over an infinite horizon allows us to identify the invariant region (shades of blue) from which we can maintain constraint satisfaction.

EFPPO synthesizes controllers that are safe and stable by solving an infinite horizon constrained optimization problem.

Better Stability with Epigraph Form

EFPPO uses the epigraph form to solve the constrained optimization problem, improving optimization stability over classical Lagrangian duality methods. The gradient for (CMDP) Lagrangian duality-based methods scales linearly with \(\lambda\), while the gradient for the epigraph form does not scale with \(z\), which can cause optimization problems when \(\lambda\) grows large.

Varying \(\lambda\) and \(z\) on the same cost function and constraint function at a given point, the gradient norms (right) of the objective grow for Lagrangian duality but not for the epigraph form.

Simulation Experiments

Hopper

Stabilize
Torso is stable in the green box
Avoid
Torso touches the red box

F16 Fighter Jet

Stabilize
Stabilize to the green region near the floor
Avoid
Avoid hitting the floor, ceiling or walls. Avoid extreme angles of attack.

Abstract

Tasks for autonomous robotic systems commonly require stabilization to a desired region while maintaining safety specifications. However, solving this multi-objective problem is challenging when the dynamics are nonlinear and high-dimensional, as traditional methods do not scale well and are often limited to specific problem structures.

To address this issue, we propose a novel approach to solve the stabilize-avoid problem via the solution of an infinite-horizon constrained optimal control problem (OCP). We transform the constrained OCP into epigraph form and obtain a two-stage optimization problem that optimizes over the policy in the inner problem and over an auxiliary variable in the outer problem. We then propose a new method for this formulation that combines an on-policy deep reinforcement learning algorithm with neural network regression. Our method yields better stability during training, avoids instabilities caused by saddle-point finding, and is not restricted to specific requirements on the problem structure compared to more traditional methods. We validate our approach on different benchmark tasks, ranging from low-dimensional toy examples to an F16 fighter jet with a 17-dimensional state space. Simulation results show that our approach consistently yields controllers that match or exceed the safety of existing methods while providing ten-fold increases in stability performance from larger regions of attraction.