Skip to content

IlgeeHong/Think-RM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

📝 Paper · 🚀 Project Page

Overview of the Think-RM training framework

Overview

This repository implements Think-RM, a training framework that enables long-horizon reasoning in GenRMs by modeling an internal thinking process. In addition, we implement pairwise RLHF pipeline that directly optimizes policies using pairwise preference. It includes the following components:

  • EvalGenRM: Generative reward model evaluation.

  • OpenRLHF: Warm-up SFT training.

  • VERL: Rule-based RL.

  • PairwiseRLHF: Pairwise RLHF for GenRMs.

    Think-RM/
    ├── EvalGenRM/            # GenRM evaluation
    ├── OpenRLHF/             # SFT warm-up
    ├── VERL/                 # Rule-based RL
    └── PairwiseRLHF/         # Pairwise RLHF for GenRMs

🤗 Pretrained Models

🧠 Binary Think-RM-8B

🧠 Multiclass Think-RM-8B

🧠 Binary Think-RM-3B

🧠 Binary CoT-GenRM-8B (Ground Truth)

🧠 Multiclass CoT-GenRM-8B (Ground Truth)

🧠 Binary CoT-GenRM-8B (Model-Generated)

🧠 Multiclass CoT-GenRM-8B (Model-Generated)


🤗 Long CoT Data

📄 Binary Long CoT

📄 Multiclass Long CoT


📝 Table of Contents

  1. GenRM Evaluation
    1.1 Setup
    1.2 Usage

  2. Warm-up-SFT
    2.1 Setup
    2.2 Usage

  3. Rule-based-RL
    3.1 Setup
    3.2 Usage

  4. Pairwise RLHF for GenRMs
    4.1 Setup
    4.2 Usage


1. GenRM Evaluation

1.1 Setup

See 3.1 Setup.

1.2 Usage

To evaluate a GenRM, run:

export CUDA_VISIBLE_DEVICES="0"
python3 EvalGenRM/eval.py \
    --max_tokens 16384 \
    --temperature 0.6 \
    --world_size 1 \
    --model_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset nvidia/HelpSteer2 \
    --template naive-reasoning-binary \
    --save_dir /workspace/results \
    --custom_chat_template "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n<think>\n' }}{% endif %}"

To enable vertical inference-time scaling (with scaling factor $m$), run:

export CUDA_VISIBLE_DEVICES="0"
python3 EvalGenRM/eval_bn.py \
    --max_tokens 16384 \
    --temperature 0.6 \
    --world_size 1 \
    --model_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset nvidia/HelpSteer2 \
    --template naive-reasoning-binary \
    --save_dir /workspace/results \
    --custom_chat_template "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n<think>\n' }}{% endif %}" \
    --m 4

2. Warm-up SFT

The implementation is based on OpenRLHF.

2.1 Setup

Install the necessary dependencies

pip install -r requirements_sft.txt

or follow the setup instructions from OpenRLHF.

2.2 Usage

Choose the appropriate training script based on the output type:


3. Rule-based RL

The implementation is based on verl.

3.1 Setup

Install the necessary dependencies

pip install -r requirements.txt

or follow the setup instructions from verl.

3.2 Usage

Choose the appropriate training script based on the output type:


4. Pairwise RLHF for GenRMs

The implementation is based on verl.

4.1 Setup

See 3.1 Setup.

4.2 Usage

This setup assumes two nodes:

  • Node 1: vLLM-based GenRM inference server
  • Node 2: RL training

Step 1: Launch GenRM vLLM Servers


Step 2: Run RLHF Training

Choose the appropriate training script based on the model:

Make sure to update the vLLM server IP address in each script.


📄 Citation

If you find this work helpful, please cite our paper:

@article{hong2025think,
  title={Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models},
  author={Hong, Ilgee and Yu, Changlong and Qiu, Liang and Yan, Weixiang and Xu, Zhenghao and Jiang, Haoming and Zhang, Qingru and Lu, Qin and Liu, Xin and Zhang, Chao and others},
  journal={arXiv preprint arXiv:2505.16265},
  year={2025}
}

About

[NeurIPS 2025] Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •