Skip to content

wang-research-lab/verification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Budget-Aware Test-Time Scaling via Discriminative Verification

This repository contains the implementation for the paper "Budget-Aware Test-Time Scaling via Discriminative Verification".

📃 [Paper] • 📌 [Blog] • 💻 [GitHub] • 🤗 [Hugging Face]

Installation

git clone https://github.com/wang-research-lab/verification.git
cd verification

conda create -n verification python=3.10
conda activate verification

pip install -e . # will install `verification` and various dependencies

Usage

1. Generate Candidate Solutions

Use gen_trajectories.py to generate candidate solutions via vLLM. To sample a solution from "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" for every training problem:

python scripts/gen_trajectories.py \
    --model_name "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" \
    --save_path "data/deepseek-r1-1.5b-verification-training-problems-responses.jsonl" \
    --num_gpus 8 \
    --dataset_name "verification-training-problems"

gen_trajectories.py can also generate candidate solutions for evaluation datasets (aime2024, aime2025, livebench-math, and gpqa):

python scripts/gen_trajectories.py \
    --model_name "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
    --save_path "data/deepseek-32b-aime2024-responses.jsonl" \
    --num_gpus 8 \
    --tp_size 8 \
    --n_rollouts 128 \
    --dataset_name "aime2024"

Alternatively, you can use gen_trajectories.py with an OpenAI-compatible API instead of vLLM:

python scripts/gen_trajectories.py \
    --model_name "deepseek-ai/DeepSeek-R1" \
    --save_path "data/deepseek-r1-verification-training-problems-responses.jsonl" \
    --dataset_name "verification-training-problems" \
    --use_api True \
    --endpoint "https://api.together.xyz/v1" \
    --api_key "Your-Together-API-Key" \
    --concurrency_limit 20

2. Train Discriminative Verifier

Train a 1.5B parameter discriminative verifier using accelerate with FSDP:

accelerate launch --config_file configs/fsdp_8gpu.yaml scripts/train_ranking.py \
    --model_name "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" \
    --dataset_name "verification-training-data" \
    --ckpt_path "outputs/verification-1.5b" \
    --per_device_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr 5e-5

3. Run Verification on Evaluation Dataset

Use run_judge_hf.py to score candidate solutions with the trained verifier:

python scripts/run_judge_hf.py \
    --model_name "WangResearchLab/verification-1.5b" \
    --dataset_name "verification-evaluation-data" \
    --dataset_split "validation" \
    --save_path "evals/verification-1.5b/validation-eval.jsonl" \
    --num_gpus 8

Citation

@article{montgomery2025budget,
  title={Budget-Aware Test-Time Scaling via Discriminative Verification},
  author={Montgomery, Kyle and Tan, Sijun and Chen, Yuqi and Zhuang, Siyuan and Zhang, Tianjun and Popa, Raluca Ada and Wang, Chenguang},
  journal={arXiv preprint arXiv:2510.14913},
  year={2025}
}

About

Code repo for "Budget-aware Test-time Scaling via Discriminative Verification"

Resources

Stars

Watchers

Forks