📝 Paper · 🚀 Project Page
This repository implements Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. In addition, we implement pairwise RLHF pipeline that directly optimizes policies using pairwise preference. It includes the following components:
-
EvalGenRM: Generative reward model evaluation.
-
OpenRLHF: Warm-up SFT training.
-
VERL: Rule-based RL.
-
PairwiseRLHF: Pairwise RLHF for GenRMs.
Think-RM/ ├── EvalGenRM/ # GenRM evaluation ├── OpenRLHF/ # SFT warm-up ├── VERL/ # Rule-based RL └── PairwiseRLHF/ # Pairwise RLHF for GenRMs
🧠 Binary CoT-GenRM-8B (Ground Truth)
🧠 Multiclass CoT-GenRM-8B (Ground Truth)
🧠 Binary CoT-GenRM-8B (Model-Generated)
🧠 Multiclass CoT-GenRM-8B (Model-Generated)
-
GenRM Evaluation
1.1 Setup
1.2 Usage -
Warm-up-SFT
2.1 Setup
2.2 Usage -
Rule-based-RL
3.1 Setup
3.2 Usage -
Pairwise RLHF for GenRMs
4.1 Setup
4.2 Usage
See 3.1 Setup.
To evaluate a GenRM, run:
export CUDA_VISIBLE_DEVICES="0"
python3 EvalGenRM/eval.py \
--max_tokens 16384 \
--temperature 0.6 \
--world_size 1 \
--model_path meta-llama/Llama-3.1-8B-Instruct \
--dataset nvidia/HelpSteer2 \
--template naive-reasoning-binary \
--save_dir /workspace/results \
--custom_chat_template "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n<think>\n' }}{% endif %}"To enable vertical inference-time scaling (with scaling factor
export CUDA_VISIBLE_DEVICES="0"
python3 EvalGenRM/eval_bn.py \
--max_tokens 16384 \
--temperature 0.6 \
--world_size 1 \
--model_path meta-llama/Llama-3.1-8B-Instruct \
--dataset nvidia/HelpSteer2 \
--template naive-reasoning-binary \
--save_dir /workspace/results \
--custom_chat_template "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n<think>\n' }}{% endif %}" \
--m 4The implementation is based on OpenRLHF.
Install the necessary dependencies
pip install -r requirements_sft.txtor follow the setup instructions from OpenRLHF.
Choose the appropriate training script based on the output type:
-
Binary
Run:run_binary_max_sft.sh -
Multiclass
Run:run_multiclass_max_sft.sh
The implementation is based on verl.
Install the necessary dependencies
pip install -r requirements.txtor follow the setup instructions from verl.
Choose the appropriate training script based on the output type:
-
Binary
Run:run_binary_max_grpo.sh -
Multiclass
Run:run_multiclass_max_grpo.sh
The implementation is based on verl.
See 3.1 Setup.
This setup assumes two nodes:
- Node 1: vLLM-based GenRM inference server
- Node 2: RL training
Step 1: Launch GenRM vLLM Servers
-
Select a GenRM from the Pretrained Models section.
-
Launch 8 vLLM servers (1 GPU each)
Run:load_vllm_server.sh -
Launch 4 vLLM servers (2 GPUs each)
Run:load_vllm_server_2gpu.sh
Step 2: Run RLHF Training
Choose the appropriate training script based on the model:
-
Multiclass Think-RM
Run:run_rlhf_multiclass.sh -
Binary Think-RM
Run:run_rlhf_binary.sh -
Multiclass CoT-GenRM (Ground Truth)
Run:run_rlhf_groundtruth_multiclass.sh -
Multiclass CoT-GenRM (Model-Generated)
Run:run_rlhf_generated_multiclass.sh
Make sure to update the vLLM server IP address in each script.
If you find this work helpful, please cite our paper:
@article{hong2025think,
title={Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models},
author={Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and others},
journal={arXiv preprint arXiv:2505.16265},
year={2025}
}