TreeGRPO

Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Zheng Ding*1, Weirui Ye*2

1UC San Diego    2MIT CSAIL

* denotes equal contribution

Combined analysis
Merged results

TreeGRPO achieves the best Pareto performance across the rewards and training efficiency, where the single-GPU runtime is the normalized wall-clock time. Following normalized metrics in RL domains, the normalized reward scores are calculated by \( (r - r_{sd3.5}) / (r_{max} - r_{sd3.5}) \), where \( r_{max} = \{1.0, 2.0, 10.0, 1.0\} \) for HPS, ImageReward, Aesthetic, and CLIPScore.

Abstract

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce TreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) High sample efficiency, achieving better performance under same training samples (2) Fine-grained credit assignment via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) Amortized computation where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves 2.4x faster training while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment.

Method

Method
Algorithm

TreeGRPO method, which models the diffusion denoising process as a tree structure where stochastic transitions create branches (blue solid lines) and deterministic transitions form frozen paths (orange dashed lines). Rewards are computed at the leaf nodes and normalized to advantages, which are then backpropagated through the tree to assign credit to each decision (edge). These edge-level advantages enable targeted GRPO policy updates, improving learning efficiency by leveraging shared prefixes and multiple advantage signals from a single rollout.

Results

Result 1

Train using HPS-v2.1 reward model and Eval on four reward models.

Result 2

Train using HPS-v2.1 and ClipScore reward model and Eval on four reward models.

BibTeX

@article{ding2025treegrpo,
  title={TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models},
  author={Ding, Zheng and Ye, Weirui},
  journal={arXiv preprint arXiv:2512.08153},
  year={2025}
}