Skip to content

HarmanDotpy/pairwise-self-verification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pairwise Self-Verification

Code for V₁-Infer, the pairwise self-verification algorithm from the paper:

V₁: Unifying Generation and Self-Verification for Parallel Reasoners

Project Page | Paper

V₁-Infer

Overview

V₁-Infer is an inference-time algorithm that uses pairwise comparisons in a Swiss tournament to rank multiple candidate solutions generated by an LLM. Instead of scoring each solution independently (pointwise), the same LLM compares solutions head-to-head, which produces substantially more accurate verification.

Algorithm

  1. Generate N candidate solutions to a problem in parallel
  2. Compare candidates pairwise using the LLM as a judge (Swiss tournament with budget control)
  3. Rank candidates by margin-weighted win rate (μ)
  4. Select the top-ranked candidate

The Swiss tournament structure ensures efficient budget usage: early rounds establish connectivity (every candidate gets compared at least min_degree times), then subsequent rounds focus comparisons on candidates with similar scores.

Installation

pip install numpy polars pandas hydra-core omegaconf transformers \
    sglang openai tqdm termcolor tabulate sympy

Datasets

All evaluation datasets are included in data/:

Dataset File Domain
LiveCodeBench-v5 data/test_livecodebench.parquet Code (auto-downloaded from HuggingFace)
LiveCodeBench-v6 data/test_livecodebench_v6.parquet Code
CodeContests data/test_code_contests.parquet Code
AIME 2025 data/aime_2025.parquet Math
HMMT Feb 2025 data/hmmt_feb_2025.parquet Math

Quick Start

Code Benchmarks

# GPT-OSS-20B on LiveCodeBench-v6, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise.sh "openai/gpt-oss-20b" "" "test_livecodebench_v6" \
    0 -1 3.0 livecodebench_v6_prompt_default true medium 0.6 false livecodebench 1234 8 0.1 16 false true

# Qwen3-4B-Instruct on LiveCodeBench-v6, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
    0 -1 3.0 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 8 0.1 16 false true

Math Benchmarks

# GPT-OSS-20B on AIME 2025, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
    0 -1 3.0 true medium 1.0 false 1234 8 0.1 16 false

# Qwen3-4B-Instruct on AIME 2025, 16 candidates, budget=3x
./eval/scripts/run_e2e_pairwise_math.sh "Qwen/Qwen3-4B-Instruct-2507" "" "aime_2025" \
    0 -1 3.0 false medium 1.0 false 1234 8 0.1 16 false

Pointwise Verification (Baseline)

Pointwise verification scores each solution independently (no pairwise comparisons). This serves as a baseline for comparison with V₁-Infer.

Code Benchmarks (Pointwise)

# GPT-OSS-20B on LiveCodeBench-v6, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise.sh "openai/gpt-oss-20b" "" "test_livecodebench_v6" \
    0 -1 livecodebench_v6_prompt_default true medium 0.6 false livecodebench 1234 16 false true

# Qwen3-4B-Instruct on LiveCodeBench-v6, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
    0 -1 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 16 false true

Math Benchmarks (Pointwise)

# GPT-OSS-20B on AIME 2025, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
    0 -1 true medium 1.0 false 1234 16 false

# Qwen3-4B-Instruct on AIME 2025, 16 candidates, pointwise (fresh generation)
./eval/scripts/run_e2e_pointwise_math.sh "Qwen/Qwen3-4B-Instruct-2507" "" "aime_2025" \
    0 -1 false medium 1.0 false 1234 16 false

Reusing Generations from a Pairwise Run

To skip generation and only run pointwise verification on candidates from a previous pairwise run, pass the pairwise result parquet as the last argument:

# Pointwise on Qwen LCB-v6, reusing generations from a pairwise run
./eval/scripts/run_e2e_pointwise.sh "Qwen/Qwen3-4B-Instruct-2507" "" "test_livecodebench_v6" \
    0 -1 livecodebench_v6_prompt_instruct false medium 0.6 false livecodebench 1234 16 false true \
    ./results/Qwen3-4B-Instruct-2507/test_livecodebench_v6_n16_budget3_0_seed1234/result-00000-of-00001.parquet

# Pointwise on GPT-OSS AIME 2025, reusing generations from a pairwise run
./eval/scripts/run_e2e_pointwise_math.sh "openai/gpt-oss-20b" "" "aime_2025" \
    0 -1 true medium 1.0 false 1234 16 false \
    ./results/gpt-oss-20b/aime_2025_n16_budget3_0_seed1234/result-00000-of-00001.parquet

This loads the responses column from the parquet file and skips candidate generation entirely, running only the pointwise verification step.

See scripts/run_e2e_commands_n16.sh for more commands.

Expected Results

Results from running the commands above with N=16 candidates (3x H100 GPUs via Modal):

Model Benchmark Method Budget Accuracy
GPT-OSS-20B LiveCodeBench-v6 Pass@1 -- 61.4%
Pointwise N 71.8%
V₁-Infer (pairwise) 3N 76.3%
GPT-OSS-20B AIME 2025 Pass@1 -- 71.9%
Pointwise N 83.3%
V₁-Infer (pairwise) 3N 86.7%
Qwen3-4B-Instruct LiveCodeBench-v6 Pass@1 -- 35.4%
Pointwise N 38.9%
V₁-Infer (pairwise) 3N 43.5%
Qwen3-4B-Instruct AIME 2025 Pass@1 -- 45.4%
Pointwise N 53.3%
V₁-Infer (pairwise) 3N 63.3%
  • Pass@1: Average correctness of a single random sample
  • Pointwise: Each solution scored independently (1-10) by the same LLM, best score selected
  • V₁-Infer (pairwise): Pairwise self-verification with Swiss tournament
  • Budget: Number of LLM verification calls (as a multiple of N candidates)

Note: Results above are from a single seed and may vary across runs. We recommend running with 3 seeds and averaging for more stable estimates.

Code Evaluation Arguments

./eval/scripts/run_e2e_pairwise.sh <MODEL_PATH> <MODEL_DIR_NAME_OVERRIDE> <DATASET> \
    <START_IDX> <TOTAL_SAMPLES> <BUDGET_MULTIPLIER> <CODE_PROMPT_TEMPLATE> <USE_GPTOSS> \
    <REASONING_EFFORT> <TEMPERATURE> <LOAD_CACHE> <DATA_SOURCE> <SEED> <MAX_WINDOW> \
    <TAU> <N_PASSES> <FLEXIBLE_THINK> <USE_SGLANG>
Argument Description Default
MODEL_PATH HuggingFace model ID or local path openai/gpt-oss-20b
DATASET test_livecodebench, test_livecodebench_v6, code_contests test_livecodebench_v6
N_PASSES Number of candidate solutions to generate 16
BUDGET_MULTIPLIER Verification budget as multiple of N (e.g., 3.0 = 3N pairwise calls) 3.0
CODE_PROMPT_TEMPLATE livecodebench_v6_prompt_default (GPT-OSS) or livecodebench_v6_prompt_instruct (Qwen) livecodebench_v6_prompt_default
SEED Random seed 1234

Math Evaluation Arguments

./eval/scripts/run_e2e_pairwise_math.sh <MODEL_PATH> <MODEL_DIR_NAME_OVERRIDE> <DATASET> \
    <START_IDX> <TOTAL_SAMPLES> <BUDGET_MULTIPLIER> <USE_GPTOSS> <REASONING_EFFORT> \
    <TEMPERATURE> <LOAD_CACHE> <SEED> <MAX_WINDOW> <TAU> <N_PASSES> <FLEXIBLE_THINK>
Argument Description Default
MODEL_PATH HuggingFace model ID or local path openai/gpt-oss-20b
DATASET aime_2025, hmmt_feb_2025 aime_2025
N_PASSES Number of candidate solutions 16
BUDGET_MULTIPLIER Verification budget as multiple of N 3.0
SEED Random seed 1234
USE_GPTOSS Enable GPT-OSS reasoning features false

Key Parameters

Parameter Description Default
coverage_strategy Pair selection strategy: min_degree, swiss_parallel, random min_degree
min_degree Minimum number of distinct opponents per candidate 2
max_window Sliding window size for Swiss pairing 8
tau Minimum weight floor for margin-weighted aggregation 0.1
load_cache Load cached LLM responses from cache/ to skip redundant API calls (set to true to reuse results from a previous run) false

File Structure

├── data/                                                     # Evaluation datasets
│   ├── test_livecodebench.parquet                            # LiveCodeBench-v5
│   ├── test_livecodebench_v6.parquet                         # LiveCodeBench-v6
│   ├── test_code_contests.parquet                            # CodeContests
│   ├── aime_2025.parquet                                     # AIME 2025
│   └── hmmt_feb_2025.parquet                                 # HMMT Feb 2025
├── eval/
│   ├── eval_e2e_pairwise.py                                  # End-to-end pairwise pipeline
│   ├── eval_e2e_pointwise.py                                 # End-to-end pointwise pipeline
│   ├── verify_pairwise.py                                    # Swiss tournament verification
│   ├── verify_pointwise.py                                   # Pointwise verification
│   ├── verify_utils.py                                       # Verification prompts, parsing, aggregation
│   ├── eval_results_parallel.py                              # Parallel solution grading
│   ├── utils.py                                              # Shared utilities (API, caching, etc.)
│   └── scripts/
│       ├── run_e2e_pairwise.sh                               # Pairwise code evaluation
│       ├── run_e2e_pairwise_math.sh                          # Pairwise math evaluation
│       ├── run_e2e_pointwise.sh                              # Pointwise code evaluation
│       └── run_e2e_pointwise_math.sh                         # Pointwise math evaluation
├── rewards/                                                  # Reward/grading functions
│   ├── rl_reward.py                                          # Main reward dispatcher
│   ├── code_reward.py                                        # Code execution grading
│   ├── math_reward.py                                        # Math answer grading
│   └── code_utils/                                           # Per-benchmark execution
├── config/
│   └── generation.yaml                                       # Hydra configuration defaults
└── scripts/
    ├── run_e2e_commands_n16.sh                               # N=16 evaluation commands
    └── modal_eval_commands_n16.sh                            # N=16 commands for Modal

Acknowledgements

We thank Modal for providing compute for the evaluations in this repository.

Citation

@misc{singh2026v1unifyinggenerationselfverification,
      title={$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners},
      author={Harman Singh and Xiuyu Li and Kusha Sareen and Monishwaran Maheswaran and Sijun Tan and Xiaoxia Wu and Junxiong Wang and Alpay Ariyak and Qingyang Wu and Samir Khaki and Rishabh Tiwari and Long Lian and Yucheng Lu and Boyi Li and Alane Suhr and Ben Athiwaratkun and Kurt Keutzer},
      year={2026},
      eprint={2603.04304},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.04304},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors