ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

1Massachusetts Institute of Technology     2Boston University     3MIT Lincoln Laboratory

ReFORM: How to maintain support constraints in offline RL without any statistical distance regularization.

You no longer need to tune regularization weights!

Overview

  • ReFORM is an offline RL algorithm which utilizes flow-based policies to enforce the support constraint by construction, avoiding out-of-distribution errors without constraining policy improvement.
  • We propose applying reflected flow to generate constrained multimodal noise for the BC flow policy, thereby mitigating OOD errors while maintaining the multimodal policy.
  • Extensive experiments on 40 challenging tasks with datasets of different qualities demonstrate that, with a constant set of hyperparameters, ReFORM dominates all baselines using flow policy structures with the best hand-tuned hyperparameters on the performance profile curve.

Challenges

  • Out of Distribution (OOD) problem is a common challenge in offline RL, where the learned policy may generate actions outside the support of the behavior policy, leading to erroneous value estimates and poor performance.
  • Prior works attempt to address the OOD issue by keeping the learned policy close to the behavior policy (dataset) by regularizing a statistical distance to the behavior policy. However, this approach constrains policy improvement and may not fully prevent OOD actions. Also they often require careful hyperparameter tuning for different tasks and datasets.
  • While avoiding OOD errors, it is also important to maintain the expressiveness of the policy.

Method

algorithm structure
  • ReFORM starts by learning a BC flow policy (gray arrows), which transforms a simple bounded source distribution \(q_\mathrm{BC}=\mathcal{U}(\mathcal B_l^d)\) (uniform distribution in \(d\)-dimensional hypersphere with radius \(l\)) to a target distribution \(p_\mathrm{BC}\) that matches the dataset \(\mathcal D\). The BC policy captures the support of the behavior policy due to the bounded source distribution, which allows us to enforce the support constraint in offline RL.
  • At the same time, ReFORM learns a reflected flow-based noise generator (blue arrows) that generate a manipulated source distribution \(\tilde q_\mathrm{BC}\) for the BC policy, such that the manipulated target distribution \(\tilde p_\mathrm{BC}\) maximizes the \(Q\) value while staying inside the support (red) of the BC policy. This allows us to avoid OOD errors while maintaining policy expressiveness.
BC
BC
DSRL
DSRL
IFQL
IFQL
FQL(S)
FQL(S)
ReFORM
FQL(M)
FQL(L)
FQL(L)
ReFORM
ReFORM
  • Comparing with the baselines, only ReFORM is able to generate multimodal actions with maximum Q values while staying within the support (red) of the BC policy.

Experiments

Tasks

antmaze-large

cube-single

cube-double

scene

  • We evaluate ReFORM and the baselines on 40 tasks from the OGBench offline RL benchmark designed in four environments, including locomotion tasks and manipulation tasks.
clean-dataset-antmaze
clean dataset for antmaze
noisy-dataset-antmaze
noisy dataset for antmaze
  • We use two kinds of datasets, clean and noisy. The clean dataset consists of random environment trajectories generated by an expert policy. The noisy dataset consists of random trajectories generated by a highly suboptimal and noisy policy.
  • We define the normalized score for each task as the return normalized by the minimum and maximum returns across all algorithms.

Results

performance profile
  • For a given normalized score \(\tau\) (x-axis), the performance profile shows the probability that a given method achieves a score \(\geq\tau\). On the clean dataset, ReFORM achieves greater scores with higher probabilities than all other baselines. The same is true on the noisy dataset except for a small set of normalized scores around 0.9 where ReFORM and FQL(S) have similar probabilities within the statistical margins. Note that ReFORM uses a constant set of hyperparameters for all tasks, while all baselines use hand-tuned hyperparameters for each task.
antmaze-large-navigate_normalized_reward_boxplot cube-single-play_normalized_reward_boxplot cube-double-play_normalized_reward_boxplot scene-play_normalized_reward_boxplot antmaze-large-explore_normalized_reward_boxplot cube-single-noisy_normalized_reward_boxplot cube-double-noisy_normalized_reward_boxplot scene-noisy_normalized_reward_boxplot
  • We present bar plots of the interquantile means (IQM) of the normalized scores for each algorithm in each environment with the clean dataset (top row) and the noisy dataset (bottom row). We can observe that ReFORM consistently achieves the best or comparable results in all environments with both datasets, with a constant set of hyperparameters. DSRL and FQL(M) generally perform the second and third best in environments with the clean dataset. However, their performance drops when the noisy dataset is used.

Abstract

Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed behavior policy dataset without additional environment interactions. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a behavior cloning (BC) flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.

BibTeX

@inproceedings{zhang2026reform,
      title={Re{FORM}: Reflected Flows for On-support Offline {RL} via Noise Manipulation},
      author={Zhang, Songyuan and So, Oswin and Ahmad, H M Sabbir and Yu, Eric Yang and Cleaveland, Matthew and Black, Mitchell and Fan, Chuchu},
      booktitle={The Fourteenth International Conference on Learning Representations},
      year={2026},
}