Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Overview of the Think-RM training framework

Overview

This repository implements Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. In addition, we implement pairwise RLHF pipeline that directly optimizes policies using pairwise preference. It includes the following components:

EvalGenRM: Generative reward model evaluation.
OpenRLHF: Warm-up SFT training.
VERL: Rule-based RL.

PairwiseRLHF: Pairwise RLHF for GenRMs.

Think-RM/
├── EvalGenRM/            # GenRM evaluation
├── OpenRLHF/             # SFT warm-up
├── VERL/                 # Rule-based RL
└── PairwiseRLHF/         # Pairwise RLHF for GenRMs

🤗 Pretrained Models

🧠 Binary Think-RM-8B

🧠 Multiclass Think-RM-8B

🧠 Binary Think-RM-3B

🧠 Binary CoT-GenRM-8B (Ground Truth)

🧠 Multiclass CoT-GenRM-8B (Ground Truth)

🧠 Binary CoT-GenRM-8B (Model-Generated)

🧠 Multiclass CoT-GenRM-8B (Model-Generated)

🤗 Long CoT Data

📄 Binary Long CoT

📄 Multiclass Long CoT

📝 Table of Contents

GenRM Evaluation
1.1 Setup
1.2 Usage
Warm-up-SFT
2.1 Setup
2.2 Usage
Rule-based-RL
3.1 Setup
3.2 Usage
Pairwise RLHF for GenRMs
4.1 Setup
4.2 Usage

1. GenRM Evaluation

1.1 Setup

See 3.1 Setup.

1.2 Usage

To evaluate a GenRM, run:

export CUDA_VISIBLE_DEVICES="0"
python3 EvalGenRM/eval.py \
    --max_tokens 16384 \
    --temperature 0.6 \
    --world_size 1 \
    --model_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset nvidia/HelpSteer2 \
    --template naive-reasoning-binary \
    --save_dir /workspace/results \
    --custom_chat_template "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n<think>\n' }}{% endif %}"

To enable vertical inference-time scaling (with scaling factor $m$), run:

export CUDA_VISIBLE_DEVICES="0"
python3 EvalGenRM/eval_bn.py \
    --max_tokens 16384 \
    --temperature 0.6 \
    --world_size 1 \
    --model_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset nvidia/HelpSteer2 \
    --template naive-reasoning-binary \
    --save_dir /workspace/results \
    --custom_chat_template "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n<think>\n' }}{% endif %}" \
    --m 4

2. Warm-up SFT

The implementation is based on OpenRLHF.

2.1 Setup

Install the necessary dependencies

pip install -r requirements_sft.txt

or follow the setup instructions from OpenRLHF.

2.2 Usage

Choose the appropriate training script based on the output type:

Binary
Run: run_binary_max_sft.sh
Multiclass
Run: run_multiclass_max_sft.sh

3. Rule-based RL

The implementation is based on verl.

3.1 Setup

Install the necessary dependencies

pip install -r requirements.txt

or follow the setup instructions from verl.

3.2 Usage

Choose the appropriate training script based on the output type:

Binary
Run: run_binary_max_grpo.sh
Multiclass
Run: run_multiclass_max_grpo.sh

4. Pairwise RLHF for GenRMs

The implementation is based on verl.

4.1 Setup

See 3.1 Setup.

4.2 Usage

This setup assumes two nodes:

Node 1: vLLM-based GenRM inference server
Node 2: RL training

Step 1: Launch GenRM vLLM Servers

Select a GenRM from the Pretrained Models section.
Launch 8 vLLM servers (1 GPU each)
Run: load_vllm_server.sh
Launch 4 vLLM servers (2 GPUs each)
Run: load_vllm_server_2gpu.sh

Step 2: Run RLHF Training

Choose the appropriate training script based on the model:

Multiclass Think-RM
Run: run_rlhf_multiclass.sh
Binary Think-RM
Run: run_rlhf_binary.sh
Multiclass CoT-GenRM (Ground Truth)
Run: run_rlhf_groundtruth_multiclass.sh
Multiclass CoT-GenRM (Model-Generated)
Run: run_rlhf_generated_multiclass.sh

Make sure to update the vLLM server IP address in each script.

📄 Citation

If you find this work helpful, please cite our paper:

@article{hong2025think,
  title={Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models},
  author={Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and others},
  journal={arXiv preprint arXiv:2505.16265},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Overview

🤗 Pretrained Models

🤗 Long CoT Data

📝 Table of Contents

1. GenRM Evaluation

1.1 Setup

1.2 Usage

2. Warm-up SFT

2.1 Setup

2.2 Usage

3. Rule-based RL

3.1 Setup

3.2 Usage

4. Pairwise RLHF for GenRMs

4.1 Setup

4.2 Usage

📄 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
EvalGenRM		EvalGenRM
OpenRLHF		OpenRLHF
PairwiseRLHF		PairwiseRLHF
VERL		VERL
images		images
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
requirements_sft.txt		requirements_sft.txt

IlgeeHong/Think-RM

Folders and files

Latest commit

History

Repository files navigation

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Overview

🤗 Pretrained Models

🤗 Long CoT Data

📝 Table of Contents

1. GenRM Evaluation

1.1 Setup

1.2 Usage

2. Warm-up SFT

2.1 Setup

2.2 Usage

3. Rule-based RL

3.1 Setup

3.2 Usage

4. Pairwise RLHF for GenRMs

4.1 Setup

4.2 Usage

📄 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages