ReFORM

Overview

ReFORM is an offline RL algorithm which utilizes flow-based policies to enforce the support constraint by construction, avoiding out-of-distribution errors without constraining policy improvement.
We propose applying reflected flow to generate constrained multimodal noise for the BC flow policy, thereby mitigating OOD errors while maintaining the multimodal policy.
Extensive experiments on 40 challenging tasks with datasets of different qualities demonstrate that, with a constant set of hyperparameters, ReFORM dominates all baselines using flow policy structures with the best hand-tuned hyperparameters on the performance profile curve.

Out of Distribution (OOD) problem is a common challenge in offline RL, where the learned policy may generate actions outside the support of the behavior policy, leading to erroneous value estimates and poor performance.
Prior works attempt to address the OOD issue by keeping the learned policy close to the behavior policy (dataset) by regularizing a statistical distance to the behavior policy. However, this approach constrains policy improvement and may not fully prevent OOD actions. Also they often require careful hyperparameter tuning for different tasks and datasets.
While avoiding OOD errors, it is also important to maintain the expressiveness of the policy.

ReFORM starts by learning a BC flow policy (gray arrows), which transforms a simple bounded source distribution \(q_\mathrm{BC}=\mathcal{U}(\mathcal B_l^d)\) (uniform distribution in \(d\)-dimensional hypersphere with radius \(l\)) to a target distribution \(p_\mathrm{BC}\) that matches the dataset \(\mathcal D\). The BC policy captures the support of the behavior policy due to the bounded source distribution, which allows us to enforce the support constraint in offline RL.
At the same time, ReFORM learns a reflected flow-based noise generator (blue arrows) that generate a manipulated source distribution \(\tilde q_\mathrm{BC}\) for the BC policy, such that the manipulated target distribution \(\tilde p_\mathrm{BC}\) maximizes the \(Q\) value while staying inside the support (red) of the BC policy. This allows us to avoid OOD errors while maintaining policy expressiveness.

Comparing with the baselines, only ReFORM is able to generate multimodal actions with maximum Q values while staying within the support (red) of the BC policy.

antmaze-large

cube-single

cube-double

scene

We evaluate ReFORM and the baselines on 40 tasks from the OGBench offline RL benchmark designed in four environments, including locomotion tasks and manipulation tasks.

We use two kinds of datasets, clean and noisy. The clean dataset consists of random environment trajectories generated by an expert policy. The noisy dataset consists of random trajectories generated by a highly suboptimal and noisy policy.
We define the normalized score for each task as the return normalized by the minimum and maximum returns across all algorithms.

For a given normalized score \(\tau\) (x-axis), the performance profile shows the probability that a given method achieves a score \(\geq\tau\). On the clean dataset, ReFORM achieves greater scores with higher probabilities than all other baselines. The same is true on the noisy dataset except for a small set of normalized scores around 0.9 where ReFORM and FQL(S) have similar probabilities within the statistical margins. Note that ReFORM uses a constant set of hyperparameters for all tasks, while all baselines use hand-tuned hyperparameters for each task.

We present bar plots of the interquantile means (IQM) of the normalized scores for each algorithm in each environment with the clean dataset (top row) and the noisy dataset (bottom row). We can observe that ReFORM consistently achieves the best or comparable results in all environments with both datasets, with a constant set of hyperparameters. DSRL and FQL(M) generally perform the second and third best in environments with the clean dataset. However, their performance drops when the noisy dataset is used.