Budget-Aware Test-Time Scaling via Discriminative Verification

This repository contains the implementation for the paper "Budget-Aware Test-Time Scaling via Discriminative Verification".

📃 [Paper] • 📌 [Blog] • 💻 [GitHub] • 🤗 [Hugging Face]

Installation

git clone https://github.com/wang-research-lab/verification.git
cd verification

conda create -n verification python=3.10
conda activate verification

pip install -e . # will install `verification` and various dependencies

Usage

1. Generate Candidate Solutions

Use gen_trajectories.py to generate candidate solutions via vLLM. To sample a solution from "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" for every training problem:

python scripts/gen_trajectories.py \
    --model_name "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" \
    --save_path "data/deepseek-r1-1.5b-verification-training-problems-responses.jsonl" \
    --num_gpus 8 \
    --dataset_name "verification-training-problems"

gen_trajectories.py can also generate candidate solutions for evaluation datasets (aime2024, aime2025, livebench-math, and gpqa):

python scripts/gen_trajectories.py \
    --model_name "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
    --save_path "data/deepseek-32b-aime2024-responses.jsonl" \
    --num_gpus 8 \
    --tp_size 8 \
    --n_rollouts 128 \
    --dataset_name "aime2024"

Alternatively, you can use gen_trajectories.py with an OpenAI-compatible API instead of vLLM:

python scripts/gen_trajectories.py \
    --model_name "deepseek-ai/DeepSeek-R1" \
    --save_path "data/deepseek-r1-verification-training-problems-responses.jsonl" \
    --dataset_name "verification-training-problems" \
    --use_api True \
    --endpoint "https://api.together.xyz/v1" \
    --api_key "Your-Together-API-Key" \
    --concurrency_limit 20

2. Train Discriminative Verifier

Train a 1.5B parameter discriminative verifier using accelerate with FSDP:

accelerate launch --config_file configs/fsdp_8gpu.yaml scripts/train_ranking.py \
    --model_name "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B" \
    --dataset_name "verification-training-data" \
    --ckpt_path "outputs/verification-1.5b" \
    --per_device_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --lr 5e-5

3. Run Verification on Evaluation Dataset

Use run_judge_hf.py to score candidate solutions with the trained verifier:

python scripts/run_judge_hf.py \
    --model_name "WangResearchLab/verification-1.5b" \
    --dataset_name "verification-evaluation-data" \
    --dataset_split "validation" \
    --save_path "evals/verification-1.5b/validation-eval.jsonl" \
    --num_gpus 8

Citation

@article{montgomery2025budget,
  title={Budget-Aware Test-Time Scaling via Discriminative Verification},
  author={Montgomery, Kyle and Tan, Sijun and Chen, Yuqi and Zhuang, Siyuan and Zhang, Tianjun and Popa, Raluca Ada and Wang, Chenguang},
  journal={arXiv preprint arXiv:2510.14913},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
scripts		scripts
verification		verification
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Budget-Aware Test-Time Scaling via Discriminative Verification

Installation

Usage

1. Generate Candidate Solutions

2. Train Discriminative Verifier

3. Run Verification on Evaluation Dataset

Citation

About

Uh oh!

Languages

wang-research-lab/verification

Folders and files

Latest commit

History

Repository files navigation

Budget-Aware Test-Time Scaling via Discriminative Verification

Installation

Usage

1. Generate Candidate Solutions

2. Train Discriminative Verifier

3. Run Verification on Evaluation Dataset

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Languages