TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models

Abstract

Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption. We introduce TreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree. From shared initial noise samples, TreeGRPO strategically branches to generate multiple candidate trajectories while efficiently reusing their common prefixes. This tree-structured approach delivers three key advantages: (1) High sample efficiency, achieving better performance under same training samples (2) Fine-grained credit assignment via reward backpropagation that computes step-specific advantages, overcoming the uniform credit assignment limitation of trajectory-based methods, and (3) Amortized computation where multi-child branching enables multiple policy updates per forward pass. Extensive experiments on both diffusion and flow-based models demonstrate that TreeGRPO achieves 2.4x faster training while establishing a superior Pareto frontier in the efficiency-reward trade-off space. Our method consistently outperforms GRPO baselines across multiple benchmarks and reward models, providing a scalable and effective pathway for RL-based visual generative model alignment.

Method

TreeGRPO method, which models the diffusion denoising process as a tree structure where stochastic transitions create branches (blue solid lines) and deterministic transitions form frozen paths (orange dashed lines). Rewards are computed at the leaf nodes and normalized to advantages, which are then backpropagated through the tree to assign credit to each decision (edge). These edge-level advantages enable targeted GRPO policy updates, improving learning efficiency by leveraging shared prefixes and multiple advantage signals from a single rollout.

Results

Train using HPS-v2.1 reward model and Eval on four reward models.

Train using HPS-v2.1 and ClipScore reward model and Eval on four reward models.

BibTeX

@article{ding2025treegrpo,
  title={TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models},
  author={Ding, Zheng and Ye, Weirui},
  journal={arXiv preprint arXiv:2512.08153},
  year={2025}
}