Code for V₁-Infer, the pairwise self-verification algorithm from the paper:
V₁: Unifying Generation and Self-Verification for Parallel Reasoners
V₁-Infer is an inference-time algorithm that uses pairwise comparisons in a Swiss tournament to rank multiple candidate solutions generated by an LLM. Instead of scoring each solution independently (pointwise), the same LLM compares solutions head-to-head, which produces substantially more accurate verification.
- Generate N candidate solutions to a problem in parallel
- Compare candidates pairwise using the LLM as a judge (Swiss tournament with budget control)
- Rank candidates by margin-weighted win rate (μ)
- Select the top-ranked candidate
The Swiss tournament structure ensures efficient budget usage: early rounds establish connectivity (every candidate gets compared at least min_degree times), then subsequent rounds focus comparisons on candidates with similar scores.
pip install numpy polars pandas hydra-core omegaconf transformers \
sglang openai tqdm termcolor tabulate sympyAll evaluation datasets are included in data/:
| Dataset | File | Domain |
|---|---|---|
| LiveCodeBench-v5 | data/test_livecodebench.parquet |
Code (auto-downloaded from HuggingFace) |
| LiveCodeBench-v6 | data/test_livecodebench_v6.parquet |
Code |
| CodeContests | data/test_code_contests.parquet |
Code |
| AIME 2025 | data/aime_2025.parquet |
Math |
| HMMT Feb 2025 | data/hmmt_feb_2025.parquet |
Math |
# GPT-OSS-20B on LiveCodeBench-v6, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise.sh "openai/gpt-oss-20b" "" "test_livecodebench_v6" \
0 -1 3.0 livecodebench_v6_prompt_default true medium 0.6 false livecodebench 1234 8 0.1 16 false true
# Qwen3-4B-Instruct on LiveCodeBench-v6, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
0 -1 3.0 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 8 0.1 16 false true# GPT-OSS-20B on AIME 2025, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
0 -1 3.0 true medium 1.0 false 1234 8 0.1 16 false
# Qwen3-4B-Instruct on AIME 2025, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise_math.sh "Qwen/Qwen3-4B-Instruct-2507" "" "aime_2025" \
0 -1 3.0 false medium 1.0 false 1234 8 0.1 16 falsePointwise verification scores each solution independently (no pairwise comparisons). This serves as a baseline for comparison with V₁-Infer.
# GPT-OSS-20B on LiveCodeBench-v6, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise.sh "openai/gpt-oss-20b" "" "test_livecodebench_v6" \
0 -1 livecodebench_v6_prompt_default true medium 0.6 false livecodebench 1234 16 false true
# Qwen3-4B-Instruct on LiveCodeBench-v6, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
0 -1 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 16 false true# GPT-OSS-20B on AIME 2025, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
0 -1 true medium 1.0 false 1234 16 false
# Qwen3-4B-Instruct on AIME 2025, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise_math.sh "Qwen/Qwen3-4B-Instruct-2507" "" "aime_2025" \
0 -1 false medium 1.0 false 1234 16 falseTo skip generation and only run pointwise verification on candidates from a previous pairwise run, pass the pairwise result parquet as the last argument:
# Pointwise on Qwen LCB-v6, reusing generations from a pairwise run
./eval/scripts/run_e2e_pointwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
0 -1 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 16 false true \
./results/Qwen3-4B-Instruct-2507/test_livecodebench_v6_n16_budget3_0_seed1234/result-00000-of-00001.parquet
# Pointwise on GPT-OSS AIME 2025, reusing generations from a pairwise run
./eval/scripts/run_e2e_pointwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
0 -1 true medium 1.0 false 1234 16 false \
./results/gpt-oss-20b/aime_2025_n16_budget3_0_seed1234/result-00000-of-00001.parquetThis loads the responses column from the parquet file and skips candidate generation entirely, running only the pointwise verification step.
See scripts/run_e2e_commands_n16.sh for more commands.
Results from running the commands above with N=16 candidates (3x H100 GPUs via Modal):
| Model | Benchmark | Method | Budget | Accuracy |
|---|---|---|---|---|
| GPT-OSS-20B | LiveCodeBench-v6 | Pass@1 | -- | 61.4% |
| Pointwise | N | 71.8% | ||
| V₁-Infer (pairwise) | 3N | 76.3% | ||
| GPT-OSS-20B | AIME 2025 | Pass@1 | -- | 71.9% |
| Pointwise | N | 83.3% | ||
| V₁-Infer (pairwise) | 3N | 86.7% | ||
| Qwen3-4B-Instruct | LiveCodeBench-v6 | Pass@1 | -- | 35.4% |
| Pointwise | N | 38.9% | ||
| V₁-Infer (pairwise) | 3N | 43.5% | ||
| Qwen3-4B-Instruct | AIME 2025 | Pass@1 | -- | 45.4% |
| Pointwise | N | 53.3% | ||
| V₁-Infer (pairwise) | 3N | 63.3% |
- Pass@1: Average correctness of a single random sample
- Pointwise: Each solution scored independently (1-10) by the same LLM, best score selected
- V₁-Infer (pairwise): Pairwise self-verification with Swiss tournament
- Budget: Number of LLM verification calls (as a multiple of N candidates)
Note: Results above are from a single seed and may vary across runs. We recommend running with 3 seeds and averaging for more stable estimates.
./eval/scripts/run_e2e_pairwise.sh <MODEL_PATH> <MODEL_DIR_NAME_OVERRIDE> <DATASET> \
<START_IDX> <TOTAL_SAMPLES> <BUDGET_MULTIPLIER> <CODE_PROMPT_TEMPLATE> <USE_GPTOSS> \
<REASONING_EFFORT> <TEMPERATURE> <LOAD_CACHE> <DATA_SOURCE> <SEED> <MAX_WINDOW> \
<TAU> <N_PASSES> <FLEXIBLE_THINK> <USE_SGLANG>
| Argument | Description | Default |
|---|---|---|
MODEL_PATH |
HuggingFace model ID or local path | openai/gpt-oss-20b |
DATASET |
test_livecodebench, test_livecodebench_v6, code_contests |
test_livecodebench_v6 |
N_PASSES |
Number of candidate solutions to generate | 16 |
BUDGET_MULTIPLIER |
Verification budget as multiple of N (e.g., 3.0 = 3N pairwise calls) | 3.0 |
CODE_PROMPT_TEMPLATE |
livecodebench_v6_prompt_default (GPT-OSS) or livecodebench_v6_prompt_instruct (Qwen) |
livecodebench_v6_prompt_default |
SEED |
Random seed | 1234 |
./eval/scripts/run_e2e_pairwise_math.sh <MODEL_PATH> <MODEL_DIR_NAME_OVERRIDE> <DATASET> \
<START_IDX> <TOTAL_SAMPLES> <BUDGET_MULTIPLIER> <USE_GPTOSS> <REASONING_EFFORT> \
<TEMPERATURE> <LOAD_CACHE> <SEED> <MAX_WINDOW> <TAU> <N_PASSES> <FLEXIBLE_THINK>
| Argument | Description | Default |
|---|---|---|
MODEL_PATH |
HuggingFace model ID or local path | openai/gpt-oss-20b |
DATASET |
aime_2025, hmmt_feb_2025 |
aime_2025 |
N_PASSES |
Number of candidate solutions | 16 |
BUDGET_MULTIPLIER |
Verification budget as multiple of N | 3.0 |
SEED |
Random seed | 1234 |
USE_GPTOSS |
Enable GPT-OSS reasoning features | false |
| Parameter | Description | Default |
|---|---|---|
coverage_strategy |
Pair selection strategy: min_degree, swiss_parallel, random |
min_degree |
min_degree |
Minimum number of distinct opponents per candidate | 2 |
max_window |
Sliding window size for Swiss pairing | 8 |
tau |
Minimum weight floor for margin-weighted aggregation | 0.1 |
load_cache |
Load cached LLM responses from cache/ to skip redundant API calls (set to true to reuse results from a previous run) |
false |
├── data/ # Evaluation datasets
│ ├── test_livecodebench.parquet # LiveCodeBench-v5
│ ├── test_livecodebench_v6.parquet # LiveCodeBench-v6
│ ├── test_code_contests.parquet # CodeContests
│ ├── aime_2025.parquet # AIME 2025
│ └── hmmt_feb_2025.parquet # HMMT Feb 2025
├── eval/
│ ├── eval_e2e_pairwise.py # End-to-end pairwise pipeline
│ ├── eval_e2e_pointwise.py # End-to-end pointwise pipeline
│ ├── verify_pairwise.py # Swiss tournament verification
│ ├── verify_pointwise.py # Pointwise verification
│ ├── verify_utils.py # Verification prompts, parsing, aggregation
│ ├── eval_results_parallel.py # Parallel solution grading
│ ├── utils.py # Shared utilities (API, caching, etc.)
│ └── scripts/
│ ├── run_e2e_pairwise.sh # Pairwise code evaluation
│ ├── run_e2e_pairwise_math.sh # Pairwise math evaluation
│ ├── run_e2e_pointwise.sh # Pointwise code evaluation
│ └── run_e2e_pointwise_math.sh # Pointwise math evaluation
├── rewards/ # Reward/grading functions
│ ├── rl_reward.py # Main reward dispatcher
│ ├── code_reward.py # Code execution grading
│ ├── math_reward.py # Math answer grading
│ └── code_utils/ # Per-benchmark execution
├── config/
│ └── generation.yaml # Hydra configuration defaults
└── scripts/
├── run_e2e_commands_n16.sh # N=16 evaluation commands
└── modal_eval_commands_n16.sh # N=16 commands for Modal
We thank Modal for providing compute for the evaluations in this repository.
@misc{singh2026v1unifyinggenerationselfverification,
title={$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners},
author={Harman Singh and Xiuyu Li and Kusha Sareen and Monishwaran Maheswaran and Sijun Tan and Xiaoxia Wu and Junxiong Wang and Alpay Ariyak and Qingyang Wu and Samir Khaki and Rishabh Tiwari and Long Lian and Yucheng Lu and Boyi Li and Alane Suhr and Ben Athiwaratkun and Kurt Keutzer},
year={2026},
eprint={2603.04304},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.04304},
}