Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

UnifiedReward Team

🔥 News

Please leave us a star ⭐ if you find this work helpful.

[2025/11] 🔥🔥 We release Qwen-Image, Wan2.1 and FLUX.1-dev Full/LoRA training code.
[2025/11] 🔥🔥 Nano Banana Pro, FLUX.2-dev and Z-Image are added to all 🏅Leaderboard.
[2025/10] 🔥 Alibaba Group proves the effectiveness of Pref-GRPO on aligning LLMs in Taming the Judge: Deconflicting AI Feedback for Stable Reinforcement Learning. Thanks to all contributors!
[2025/9] 🔥 Seedream-4.0, GPT-4o, Imagen-4-Ultra, Nano Banana, Lumina-DiMOO, OneCAT, Echo-4o, OmniGen2, and Infinity are added to all 🏅Leaderboard.
[2025/8] 🔥 We release 🏅Leaderboard(English), 🏅Leaderboard (English Long), 🏅Leaderboard (Chinese Long) and 🏅Leaderboard(Chinese).

🔧 Environment Set Up

Clone this repository and navigate to the folder:

git clone https://github.com/CodeGoat24/UnifiedReward.git
cd UnifiedReward/Pref-GRPO

Install the training package:

conda create -n PrefGRPO python=3.12
conda activate PrefGRPO

bash env_setup.sh fastvideo

git clone https://github.com/mlfoundations/open_clip
cd open_clip
pip install -e .
cd ..

Download Models

huggingface-cli download CodeGoat24/UnifiedReward-2.0-qwen3vl-8b
huggingface-cli download CodeGoat24/UnifiedReward-Think-qwen3vl-8b

wget https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378/resolve/main/open_clip_pytorch_model.bin

💻 Training

1. Deploy vLLM server

Install vLLM

pip install vllm>=0.11.0

pip install qwen-vl-utils==0.0.14

Start server

bash vllm_utils/vllm_server_UnifiedReward_Think.sh

2. Preprocess training Data

we use training prompts in UniGenBench, as shown in "./data/unigenbench_train_data.txt".

# FLUX.1-dev
bash fastvideo/data_preprocess/preprocess_flux_rl_embeddings.sh

# Qwen-Image
pip install diffusers==0.35.0 peft==0.17.0 transformers==4.56.0

bash fastvideo/data_preprocess/preprocess_qwen_image_rl_embeddings.sh

# Wan2.1
bash fastvideo/data_preprocess/preprocess_wan_2_1_rl_embeddings.sh.sh

3. Train

# FLUX.1-dev
## UnifiedReward-Think for Pref-GRPO
bash scripts/full_train/finetune_prefgrpo_flux.sh

## UnifiedReward for Point Score-based GRPO
bash scripts/full_train/finetune_unifiedreward_flux.sh

# Qwen-Image
## UnifiedReward-Think for Pref-GRPO
bash scripts/full_train/finetune_prefgrpo_qwenimage_grpo.sh

## UnifiedReward for Point Score-based GRPO
bash scripts/full_train/finetune_unifiedreward_qwenimage_grpo.sh

# Wan2.1
## Pref-GRPO
bash scripts/full_train/finetune_prefgrpo_wan_2_1.sh

🧩 Reward Models & Usage

We support multiple reward models via the dispatcher in fastvideo/rewards/dispatcher.py. Reward model checkpoint paths are configured in fastvideo/rewards/reward_paths.py.

Supported reward models:

aesthetic
clip
hpsv2
hpsv3
pickscore
unifiedreward_think
unifiedreward_alignment
unifiedreward_style
unifiedreward_coherence
videoalign

Set rewards in your training/eval scripts

Use --reward_spec to choose which rewards to compute and (optionally) their weights.

Examples:

# Use a list of rewards (all weights = 1.0)
--reward_spec "unifiedreward_think,clip,,hpsv3"

# Weighted mix
--reward_spec "unifiedreward_alignment:0.5,unifiedreward_style:1.0,unifiedreward_coherence:0.5"

# JSON formats are also supported
--reward_spec '{"clip":0.5,"aesthetic":1.0,"hpsv2":0.5}'
--reward_spec '["clip","aesthetic","hpsv2"]'

🚀 Inference and Evaluation

we use test prompts in UniGenBench, as shown in "./data/unigenbench_test_data.csv".

# FLUX.1-dev
bash inference/flux_dist_infer.sh

# Qwen-Image
bash inference/qwen_image_dist_infer.sh

# Wan2.1
bash inference/wan_dist_infer.sh

Then, evaluate the outputs following UniGenBench.

📊 Reward-based Image Scoring (UniGenBench)

We provide a script to score a folder of generated images on UniGenBench using supported reward models.

GPU_NUM=8 bash tools/eval_quality.sh

Edit tools/eval_quality.sh to set:

--image_dir: path to your UniGenBench generated images
--prompt_csv: prompt file (default: data/unigenbench_test_data.csv)
--reward_spec: the reward models (and weights) to use
--api_url: UnifiedReward server endpoint (if using UnifiedReward-based rewards)
--output_json: output file for scores

📧 Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

🤗 Acknowledgments

Our training code is based on DanceGRPO, Flow-GRPO, and FastVideo.

We also use UniGenBench for T2I model semantic consistency evaluation.

Thanks to all the contributors!

⭐ Citation

@article{Pref-GRPO&UniGenBench,
  title={Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning},
  author={Wang, Yibin and Li, Zhimin and Zang, Yuhang and Zhou, Yujie and Bu, Jiazi and Wang, Chunyu and Lu, Qinglin and Jin, Cheng and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2508.20751},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
assets		assets
data		data
fastvideo		fastvideo
inference		inference
scripts		scripts
tools		tools
vllm_utils		vllm_utils
.gitignore		.gitignore
License.txt		License.txt
README.md		README.md
env_setup.sh		env_setup.sh
pyproject.toml		pyproject.toml
requirements-lint.txt		requirements-lint.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

🔥 News

🔧 Environment Set Up

💻 Training

1. Deploy vLLM server

2. Preprocess training Data

3. Train

🧩 Reward Models & Usage

Set rewards in your training/eval scripts

🚀 Inference and Evaluation

📊 Reward-based Image Scoring (UniGenBench)

📧 Contact

🤗 Acknowledgments

⭐ Citation

About

Uh oh!

Releases

Packages

Languages

License

CodeGoat24/Pref-GRPO

Folders and files

Latest commit

History

Repository files navigation

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

🔥 News

🔧 Environment Set Up

💻 Training

1. Deploy vLLM server

2. Preprocess training Data

3. Train

🧩 Reward Models & Usage

Set rewards in your training/eval scripts

🚀 Inference and Evaluation

📊 Reward-based Image Scoring (UniGenBench)

📧 Contact

🤗 Acknowledgments

⭐ Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages