Fine-tuning diffusion models via online reinforcement learning (RL) suffers from reward hacking, where proxy scores increase while real image quality deteriorates and generation diversity collapses. To address the competing demands of sample efficiency, effective exploration, and mitigation of reward hacking, we propose Gated and Adaptive Regularization with Diversity-aware Optimization (GARDO), a versatile framework compatible with various RL algorithms. GARDO can be applied on both Flow-GRPO and DiffusionNFT.
- Our key insight is that regularization need not be applied universally; instead, it is highly effective to selectively penalize a subset of samples that exhibit high uncertainty.
- To address the exploration challenge, GARDO introduces an adaptive regularization mechanism wherein the reference model is periodically updated to match the capabilities of the online policy, ensuring a relevant regularization target.
- To address the mode collapse issue in RL, GARDO amplifies the rewards for high-quality samples that also exhibit high diversity, encouraging mode coverage without destabilizing the optimization process.
Set up the environment
git clone https://github.com/tinnerhrhe/GARDO.git
cd GARDO
conda create -n gardo python=3.10 -y
conda activate gardo
pip install -e .
👉 Please follow the instructions of Flow-GRPO for reward preparation. We support GenEval, OCR, PickScore, ClipScore, HPSv3, Aesthetic, ImageReward and UnifiedReward for training and evaluation.
👉 To implement the diversity-aware advantage shaping, please download the dinov3 model and set the path in the codes:
huggingface-cli download timm/vit_large_patch16_dinov3.lvd1689m --local-dir <your path>
After downloading all the required models and setting up the environment, run the following script to start training the GARDO.
bash scripts/single_node/grpo_gardo_sd3.sh
If you find our work helpful, please kindly cite our paper:
@misc{he2025gardo,
title={GARDO: Reinforcing Diffusion Models without Reward Hacking},
author={Haoran He and Yuxiao Ye and Jie Liu and Jiajun Liang and Zhiyong Wang and Ziyang Yuan and Xintao Wang and Hangyu Mao and Pengfei Wan and Ling Pan},
year={2025},
eprint={2512.24138},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

