Pairwise Self-Verification

Code for V₁-Infer, the pairwise self-verification algorithm from the paper:

V₁: Unifying Generation and Self-Verification for Parallel Reasoners

Project Page | Paper

V₁-Infer

Overview

V₁-Infer is an inference-time algorithm that uses pairwise comparisons in a Swiss tournament to rank multiple candidate solutions generated by an LLM. Instead of scoring each solution independently (pointwise), the same LLM compares solutions head-to-head, which produces substantially more accurate verification.

Algorithm

Generate N candidate solutions to a problem in parallel
Compare candidates pairwise using the LLM as a judge (Swiss tournament with budget control)
Rank candidates by margin-weighted win rate (μ)
Select the top-ranked candidate

The Swiss tournament structure ensures efficient budget usage: early rounds establish connectivity (every candidate gets compared at least min_degree times), then subsequent rounds focus comparisons on candidates with similar scores.

Installation

pip install numpy polars pandas hydra-core omegaconf transformers \
    sglang openai tqdm termcolor tabulate sympy

Datasets

All evaluation datasets are included in data/:

Dataset	File	Domain
LiveCodeBench-v5	`data/test_livecodebench.parquet`	Code (auto-downloaded from HuggingFace)
LiveCodeBench-v6	`data/test_livecodebench_v6.parquet`	Code
CodeContests	`data/test_code_contests.parquet`	Code
AIME 2025	`data/aime_2025.parquet`	Math
HMMT Feb 2025	`data/hmmt_feb_2025.parquet`	Math

Quick Start

Code Benchmarks

# GPT-OSS-20B on LiveCodeBench-v6, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise.sh "openai/gpt-oss-20b" "" "test_livecodebench_v6" \
    0 -1 3.0 livecodebench_v6_prompt_default true medium 0.6 false livecodebench 1234 8 0.1 16 false true

# Qwen3-4B-Instruct on LiveCodeBench-v6, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
    0 -1 3.0 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 8 0.1 16 false true

Math Benchmarks

# GPT-OSS-20B on AIME 2025, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
    0 -1 3.0 true medium 1.0 false 1234 8 0.1 16 false

# Qwen3-4B-Instruct on AIME 2025, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise_math.sh "Qwen/Qwen3-4B-Instruct-2507" "" "aime_2025" \
    0 -1 3.0 false medium 1.0 false 1234 8 0.1 16 false

Pointwise Verification (Baseline)

Pointwise verification scores each solution independently (no pairwise comparisons). This serves as a baseline for comparison with V₁-Infer.

Code Benchmarks (Pointwise)

# GPT-OSS-20B on LiveCodeBench-v6, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise.sh "openai/gpt-oss-20b" "" "test_livecodebench_v6" \
    0 -1 livecodebench_v6_prompt_default true medium 0.6 false livecodebench 1234 16 false true

# Qwen3-4B-Instruct on LiveCodeBench-v6, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
    0 -1 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 16 false true

Math Benchmarks (Pointwise)

# GPT-OSS-20B on AIME 2025, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
    0 -1 true medium 1.0 false 1234 16 false

# Qwen3-4B-Instruct on AIME 2025, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise_math.sh "Qwen/Qwen3-4B-Instruct-2507" "" "aime_2025" \
    0 -1 false medium 1.0 false 1234 16 false

Reusing Generations from a Pairwise Run

To skip generation and only run pointwise verification on candidates from a previous pairwise run, pass the pairwise result parquet as the last argument:

# Pointwise on Qwen LCB-v6, reusing generations from a pairwise run
./eval/scripts/run_e2e_pointwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
    0 -1 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 16 false true \
    ./results/Qwen3-4B-Instruct-2507/test_livecodebench_v6_n16_budget3_0_seed1234/result-00000-of-00001.parquet

# Pointwise on GPT-OSS AIME 2025, reusing generations from a pairwise run
./eval/scripts/run_e2e_pointwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
    0 -1 true medium 1.0 false 1234 16 false \
    ./results/gpt-oss-20b/aime_2025_n16_budget3_0_seed1234/result-00000-of-00001.parquet

This loads the responses column from the parquet file and skips candidate generation entirely, running only the pointwise verification step.

See scripts/run_e2e_commands_n16.sh for more commands.

Expected Results

Results from running the commands above with N=16 candidates (3x H100 GPUs via Modal):

Model	Benchmark	Method	Budget	Accuracy
GPT-OSS-20B	LiveCodeBench-v6	Pass@1	--	61.4%
		Pointwise	N	71.8%
		V₁-Infer (pairwise)	3N	76.3%
GPT-OSS-20B	AIME 2025	Pass@1	--	71.9%
		Pointwise	N	83.3%
		V₁-Infer (pairwise)	3N	86.7%
Qwen3-4B-Instruct	LiveCodeBench-v6	Pass@1	--	35.4%
		Pointwise	N	38.9%
		V₁-Infer (pairwise)	3N	43.5%
Qwen3-4B-Instruct	AIME 2025	Pass@1	--	45.4%
		Pointwise	N	53.3%
		V₁-Infer (pairwise)	3N	63.3%

Pass@1: Average correctness of a single random sample
Pointwise: Each solution scored independently (1-10) by the same LLM, best score selected
V₁-Infer (pairwise): Pairwise self-verification with Swiss tournament
Budget: Number of LLM verification calls (as a multiple of N candidates)

Note: Results above are from a single seed and may vary across runs. We recommend running with 3 seeds and averaging for more stable estimates.

Code Evaluation Arguments

./eval/scripts/run_e2e_pairwise.sh <MODEL_PATH> <MODEL_DIR_NAME_OVERRIDE> <DATASET> \
    <START_IDX> <TOTAL_SAMPLES> <BUDGET_MULTIPLIER> <CODE_PROMPT_TEMPLATE> <USE_GPTOSS> \
    <REASONING_EFFORT> <TEMPERATURE> <LOAD_CACHE> <DATA_SOURCE> <SEED> <MAX_WINDOW> \
    <TAU> <N_PASSES> <FLEXIBLE_THINK> <USE_SGLANG>

Argument	Description	Default
`MODEL_PATH`	HuggingFace model ID or local path	`openai/gpt-oss-20b`
`DATASET`	`test_livecodebench`, `test_livecodebench_v6`, `code_contests`	`test_livecodebench_v6`
`N_PASSES`	Number of candidate solutions to generate	`16`
`BUDGET_MULTIPLIER`	Verification budget as multiple of N (e.g., 3.0 = 3N pairwise calls)	`3.0`
`CODE_PROMPT_TEMPLATE`	`livecodebench_v6_prompt_default` (GPT-OSS) or `livecodebench_v6_prompt_instruct` (Qwen)	`livecodebench_v6_prompt_default`
`SEED`	Random seed	`1234`

Math Evaluation Arguments

./eval/scripts/run_e2e_pairwise_math.sh <MODEL_PATH> <MODEL_DIR_NAME_OVERRIDE> <DATASET> \
    <START_IDX> <TOTAL_SAMPLES> <BUDGET_MULTIPLIER> <USE_GPTOSS> <REASONING_EFFORT> \
    <TEMPERATURE> <LOAD_CACHE> <SEED> <MAX_WINDOW> <TAU> <N_PASSES> <FLEXIBLE_THINK>

Argument	Description	Default
`MODEL_PATH`	HuggingFace model ID or local path	`openai/gpt-oss-20b`
`DATASET`	`aime_2025`, `hmmt_feb_2025`	`aime_2025`
`N_PASSES`	Number of candidate solutions	`16`
`BUDGET_MULTIPLIER`	Verification budget as multiple of N	`3.0`
`SEED`	Random seed	`1234`
`USE_GPTOSS`	Enable GPT-OSS reasoning features	`false`

Key Parameters

Parameter	Description	Default
`coverage_strategy`	Pair selection strategy: `min_degree`, `swiss_parallel`, `random`	`min_degree`
`min_degree`	Minimum number of distinct opponents per candidate	`2`
`max_window`	Sliding window size for Swiss pairing	`8`
`tau`	Minimum weight floor for margin-weighted aggregation	`0.1`
`load_cache`	Load cached LLM responses from `cache/` to skip redundant API calls (set to `true` to reuse results from a previous run)	`false`

File Structure

├── data/                                                     # Evaluation datasets
│   ├── test_livecodebench.parquet                            # LiveCodeBench-v5
│   ├── test_livecodebench_v6.parquet                         # LiveCodeBench-v6
│   ├── test_code_contests.parquet                            # CodeContests
│   ├── aime_2025.parquet                                     # AIME 2025
│   └── hmmt_feb_2025.parquet                                 # HMMT Feb 2025
├── eval/
│   ├── eval_e2e_pairwise.py                                  # End-to-end pairwise pipeline
│   ├── eval_e2e_pointwise.py                                 # End-to-end pointwise pipeline
│   ├── verify_pairwise.py                                    # Swiss tournament verification
│   ├── verify_pointwise.py                                   # Pointwise verification
│   ├── verify_utils.py                                       # Verification prompts, parsing, aggregation
│   ├── eval_results_parallel.py                              # Parallel solution grading
│   ├── utils.py                                              # Shared utilities (API, caching, etc.)
│   └── scripts/
│       ├── run_e2e_pairwise.sh                               # Pairwise code evaluation
│       ├── run_e2e_pairwise_math.sh                          # Pairwise math evaluation
│       ├── run_e2e_pointwise.sh                              # Pointwise code evaluation
│       └── run_e2e_pointwise_math.sh                         # Pointwise math evaluation
├── rewards/                                                  # Reward/grading functions
│   ├── rl_reward.py                                          # Main reward dispatcher
│   ├── code_reward.py                                        # Code execution grading
│   ├── math_reward.py                                        # Math answer grading
│   └── code_utils/                                           # Per-benchmark execution
├── config/
│   └── generation.yaml                                       # Hydra configuration defaults
└── scripts/
    ├── run_e2e_commands_n16.sh                               # N=16 evaluation commands
    └── modal_eval_commands_n16.sh                            # N=16 commands for Modal

Acknowledgements

We thank Modal for providing compute for the evaluations in this repository.

Citation

@misc{singh2026v1unifyinggenerationselfverification,
      title={$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners},
      author={Harman Singh and Xiuyu Li and Kusha Sareen and Monishwaran Maheswaran and Sijun Tan and Xiaoxia Wu and Junxiong Wang and Alpay Ariyak and Qingyang Wu and Samir Khaki and Rishabh Tiwari and Long Lian and Yucheng Lu and Boyi Li and Alane Suhr and Ben Athiwaratkun and Kurt Keutzer},
      year={2026},
      eprint={2603.04304},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.04304},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
config		config
data		data
eval		eval
rewards		rewards
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
modal_eval.py		modal_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pairwise Self-Verification

V₁-Infer

Overview

Algorithm

Installation

Datasets

Quick Start

Code Benchmarks

Math Benchmarks

Pointwise Verification (Baseline)

Code Benchmarks (Pointwise)

Math Benchmarks (Pointwise)

Reusing Generations from a Pairwise Run

Expected Results

Code Evaluation Arguments

Math Evaluation Arguments

Key Parameters

File Structure

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pairwise Self-Verification

V₁-Infer

Overview

Algorithm

Installation

Datasets

Quick Start

Code Benchmarks

Math Benchmarks

Pointwise Verification (Baseline)

Code Benchmarks (Pointwise)

Math Benchmarks (Pointwise)

Reusing Generations from a Pairwise Run

Expected Results

Code Evaluation Arguments

Math Evaluation Arguments

Key Parameters

File Structure

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages