News: Our paper is accepted at ICLR2026!
This repository contains code for the paper "Best-of-Infinity - Asymptotic Performance of Test-Time Compute".
- The
boinfdirectory provides analysis based on summary JSON files (jsonl/ directory). All plots are derived from the analysis in this directory. - The
answer_generationdirectory includes tools for answer generation as well as comparing best-of-N selection methods. If you only want to test the best-of-infinity results, you do not need to use this directory. - The "Large-scale Generation Dataset" (= all raw LLM answers) are found at https://figshare.com/account/articles/30208525.
boinf/train_main.py: Optimal weight search via MILP, dataset split analysis, and transfer evaluationtest_main.py: Evaluate and visualize weighted majority vote (finite-sample and population versions) with adaptive and fixed-N samplingensemble_utils.py: Utilities for loading JSONL, computing accuracies, MILP formulation (depends on HiGHS/highspy), etc.example_utils.py: Used bytrain_main.pyandtest_main.py.jsonl_to_table.py: Convertjsonl/*.jsonlto LaTeX tables and generate summariesaccuracy_per_problem.py: Plot the relationship between sample size n and majority-vote accuracy per problemcount_problems.py: Count the number of records (problems) inanalysis_*.jsonljsonl/: Updated pre-analyzed files (e.g.,analysis_aime2025_*.jsonl,analysis_math500_*.jsonl), summarized from raw answer files so that we do not need to use the generation files directly. Based on re-generated GPT-OSS-20B answers with CoTs and we have slightly updated the parser.jsonl_ver0/: Pre-analyzed files (old), used in the paper. Based on GPT-OSS-20B answers without CoT.
answer_generation/BoN_answeranalyze.py: Analyze logs stored insaved_answers/and produceanalysis_{dataset}_{llm}.jsonlBoN_choice_analyze.py: Aggregatesaved_choices/*.jsonl, and show/visualize accuracy by scale, etc.BoN_batch.py: Batch runner support forBoN_client.py(organize logs/outputs)saved_answers/: Example saved answersoutput_batch_datagen/: Example outputs from batch runs
- Python 3.10+ (We used Python 3.11.11 at runpod)
- We recommend a linux docker machine with working directory at /workspace
- Required files are found at requirement.txt
Example (pip):
pip install -r requirements.txtMove to the boinf/ directory. For example, if we want to analyze NVIDIA-Nemotron-Nano-9B-v2 on the MATH500 dataset:
cd boinf
python test_main.py jsonl/analysis_math500_NVIDIA-Nemotron-Nano-9B-v2.jsonl --n-trials 100 --analyze-bayes- All plots are saved to
plots/
Move to the boinf/ directory. For example, if we want to analyze the ensemble of 5 LLMs on the GPQA-DIAMOND dataset:
cd boinf
python test_main.py jsonl/analysis_gpqa_diamond_EXAONE-Deep-32B.jsonl jsonl/analysis_gpqa_diamond_MetaStone-S1-32B.jsonl jsonl/analysis_gpqa_diamond_Phi-4-reasoning.jsonl jsonl/analysis_gpqa_diamond_Qwen3-30B-A3B-Thinking-2507.jsonl jsonl/analysis_gpqa_diamond_gpt-oss-20b.jsonl --weights 0.0176,0.0346,0.2690,0.4144,0.2644 --n-trials 100 --analyze-bayes --no-analyze-fixed --show-single --b-bf 3000To optimize the weights
python train_main.py jsonl/analysis_gpqa_diamond_EXAONE-Deep-32B.jsonl jsonl/analysis_gpqa_diamond_MetaStone-S1-32B.jsonl jsonl/analysis_gpqa_diamond_Phi-4-reasoning.jsonl jsonl/analysis_gpqa_diamond_Qwen3-30B-A3B-Thinking-2507.jsonl jsonl/analysis_gpqa_diamond_gpt-oss-20b.jsonl cd boinf
time python train_main.py --dataset-split jsonl/analysis_math500_*.jsonl cd boinf
python train_main.py --dataset-source aime2024 --dataset-target aime2025cd boinf
python jsonl_to_table.py --allcd boinf
python jsonl_to_table.py jsonl/analysis_aime2025_gpt-oss-20b.jsonlcd boinf
python jsonl_to_table.py jsonl/analysis_aime2025_gpt-oss-20b.jsonlYou can use answer_generation/BoN_client.py to generate LLM answers in answer_generation/saved_answers/. To do so,
- Launch a vllm server on port 8100. For example, GPT-OSS can be launched based on their tutorial: https://cookbook.openai.com/articles/gpt-oss/run-vllm
vllm serve /workspace/gpt-oss-20b --port 8100- Obtain the dataset by
cd /workspace
git lfs install
git clone https://huggingface.co/datasets/opencompass/AIME2025-
Other datasets:
- AIME2024: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
- GPQA-D: https://huggingface.co/datasets/fingertap/GPQA-Diamond
- MATH500: https://github.com/openai/prm800k.git
-
To create 80 answers for each problem of AIME2025, run
cd answer_generation
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method omni --use_save --output_dir output_batch_datagen -n 5 --file_start 0 - This creates 16 processses and each process generates five answers. The first proces creates
saved_answers/aime2025_probXX_answerYY.txt, whereXXranges from 0 to 29 (i.e., AIME2025 has 30 problems) andYYranges from 0 to 4.
You can use answer_generation/BoN_answeranalyze.py to generate analysis_{dataset}_{llm}.jsonl from text logs in saved_answers/. This directory is for each LLM.
cd answer_generation
python BoN_answeranalyze.py --dataset aime2025
# Output: analysis_aime2025_<LLM_name>.jsonl (written to current directory). <LLM_name> is extracted from the variable LLM_MODEL_PORT_8100 defined in the .env file. Change it accordingly.After generating LLM answers, one can compare the selection methods as follows:
cd answer_generation
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method random -n 5 --file_start 0 --max_samples 1000
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method omni -n 5 --file_start 0 --max_samples 1000
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method majority -n 2 --file_start 0 --max_samples 1000
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method self_certainty -n 5 --file_start 0 --max_samples 1000After setting up an LLM server on port 8100
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method llm_judge_set -n 5 --file_start 0 --max_samples 1000
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method llm_judge_tournament -n 5 --file_start 0 --max_samples 1000For checking reward models, clone the reward server into /workspace and start the reward server process
./start_reward_server.sh /workspace/Skywork-Reward-V2-Llama-3.1-8Band motify the variable REWARD_MODEL_ID in the .env file accordingly.
Then, evaluate the performance of the reward server
python BoN_batch.py --start 0 --end 15 --max_workers 16 --dataset_type aime2025 --evaluation_method reward -n 5 --file_start 0 --max_samples 1000Performance of these methods can be compared by
python BoN_choice_analyze.py --dataset aime2025- Use existing
boinf/jsonl/*.jsonlto optimize/evaluate the ensemble (see "Reproduction of plots" above) - For additional models/datasets, generate your own
analysis_*.jsonlwith tools underanswer_generation/
- MIT License.