Abstract
Method
TreeGRPO method, which models the diffusion denoising process as a tree structure where stochastic transitions create branches (blue solid lines) and deterministic transitions form frozen paths (orange dashed lines). Rewards are computed at the leaf nodes and normalized to advantages, which are then backpropagated through the tree to assign credit to each decision (edge). These edge-level advantages enable targeted GRPO policy updates, improving learning efficiency by leveraging shared prefixes and multiple advantage signals from a single rollout.
Results
Train using HPS-v2.1 reward model and Eval on four reward models.
Train using HPS-v2.1 and ClipScore reward model and Eval on four reward models.
BibTeX
@article{ding2025treegrpo,
title={TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models},
author={Ding, Zheng and Ye, Weirui},
journal={arXiv preprint arXiv:2512.08153},
year={2025}
}