As Lagrangian methods usually suffer from unstable training, we use the epigraph form for constrained optimization to improve training stability. First, we deinfe the cost-value function \(V^l\) using the standard optimal control formulation: \[ V^l(x^\tau; \pi) = \sum_{k\geq\tau} l(x^k, \pi(x^k)). \] We also define the constraint-value function \(V^{h}\) as the maximum constraint violation: \[ V^{h}(x^\tau; \pi) = \max_{k\geq\tau}h(x^k) = \max_{k\geq\tau}\max_i h_i(o_i^k) = \max_i\max_{k\geq\tau} h_i(o_i^k) = \max_i V^{h}_i(o_i^\tau; \pi). \] Then, we can rewrite the MASOCP as: \[ \min_{\pi_1,\dots,\pi_N} V^l(x^0; \pi) \quad \text{s.t. } V^{h}(x^0; \pi) \leq 0. \] The epigraph form then takes the form: \[ \min_z \; z \quad \text{s.t. } \min_{\pi_1,\dots,\pi_N} V(x^0, z; \pi) \leq 0, \] where \(V(x^0, z; \pi) = \max\left\{\max_i V_i^h(o_i^\tau;\pi),V^l(x^\tau;\pi)-z\right\}\).