EFPPO synthesizes controllers that are safe and stable by solving an infinite horizon constrained optimization problem.
EFPPO uses the epigraph form to solve the constrained optimization problem, improving optimization stability over classical Lagrangian duality methods. The gradient for (CMDP) Lagrangian duality-based methods scales linearly with \(\lambda\), while the gradient for the epigraph form does not scale with \(z\), which can cause optimization problems when \(\lambda\) grows large.
- Torso is stable in the green box
- Torso touches the red box
F16 Fighter Jet
- Stabilize to the green region near the floor
- Avoid hitting the floor, ceiling or walls. Avoid extreme angles of attack.
Tasks for autonomous robotic systems commonly require stabilization to a desired region while maintaining safety specifications. However, solving this multi-objective problem is challenging when the dynamics are nonlinear and high-dimensional, as traditional methods do not scale well and are often limited to specific problem structures.
To address this issue, we propose a novel approach to solve the stabilize-avoid problem via the solution of an infinite-horizon constrained optimal control problem (OCP). We transform the constrained OCP into epigraph form and obtain a two-stage optimization problem that optimizes over the policy in the inner problem and over an auxiliary variable in the outer problem. We then propose a new method for this formulation that combines an on-policy deep reinforcement learning algorithm with neural network regression. Our method yields better stability during training, avoids instabilities caused by saddle-point finding, and is not restricted to specific requirements on the problem structure compared to more traditional methods. We validate our approach on different benchmark tasks, ranging from low-dimensional toy examples to an F16 fighter jet with a 17-dimensional state space. Simulation results show that our approach consistently yields controllers that match or exceed the safety of existing methods while providing ten-fold increases in stability performance from larger regions of attraction.