TL;DR. DISC adaptively partitions reasoning traces during inference so models spend more compute on the hardest steps, improving accuracy at fixed token budgets and cutting pass@10 error on APPS, MATH500, and LiveCodeBench.
- Overview
- Key Features
- Repo Structure
- Installation
- Quickstart
- Dynamic Decomposition (How it works)
- Reproducing Paper Results
- Evaluation & Analysis
- Supported Models
- Configuration
- Citation
DISC is a recursive inference-time procedure that proposes candidate prefixes, scores them with an outcome reward, and dynamically advances or contracts step size—allocating more samples to uncertain prefixes while skipping easy regions. Plug-and-play with greedy, beam, or MCTS search.
- Adaptive decomposition: adjusts step sizes on the fly; no handcrafted heuristics.
- Compute efficiency: higher accuracy at the same token budget; fewer total tokens for fixed samples.
- Search-agnostic: drop-in with greedy/beam/MCTS; one operator controls node expansion.
- Minimal assumptions: needs only a scalar outcome reward (e.g., unit tests, verifiers, self-critique).
- Provable monotonic improvement of best solution prefix under mild support assumptions.
.
├── src/
│ ├── solvers/ # Core dynamic decomposition logic (DISC, baselines)
│ ├── llm_models/ # Model adapters (OpenAI, HuggingFace, etc.)
│ ├── tasks/ # Task harnesses
│ │ ├── apps/ # APPS benchmark
│ │ ├── MATH/ # MATH500 benchmark
│ │ └── livecodebench/ # LiveCodeBench benchmark
│ ├── executors/ # Code execution and testing
│ ├── scripts/ # Experimental scripts for reproducing results
│ ├── data_analysis/ # Post-experiment analysis and plotting
│ │ ├── produce_standard_charts_demo.ipynb # Demo notebook for analysis
│ │ └── utils.py # Analysis utilities
│ ├── conf/ # Hydra configuration files
│ │ ├── inference.yaml # Main inference config
│ │ ├── auxgen.yaml # Test generation config
│ │ ├── solver/ # Solver configs (DISC, BoN, baselines)
│ │ └── task/ # Task-specific configs
│ ├── run_inference.py # Main entry point for inference
│ └── run_auxgen.py # Entry point for test generation
├── data/ # Generated solutions and benchmark data
│ └── generated_solutions/ # Output directory for experiments
├── environment.yml # Conda environment specification
├── requirements.txt # Pip requirements
└── README.md
We include both a environment.yml file for conda and a requirements.txt file for pip. To install the required packages, you can use either of the following commands:
# Option 1: Using conda (recommended)
conda env create -f environment.yml
conda activate discor
# Option 2: Using pip
pip install -r requirements.txtWe recommend using conda to manage the environment, as it is easier to install some of the required packages (e.g. pytorch) using conda.
Requirements:
- Python 3.13+ (3.10+ should work)
- PyTorch 2.5+ with CUDA support (for local model inference)
- API keys for proprietary models (OpenAI, Anthropic, Google) if using them
Minimal example (APPS with DISC):
bash src/scripts/apps_inference.shThis will run the full APPS experiment comparing different decomposition methods (BoN, newline-based, token-based, and DISC).
Single experiment with DISC:
python -m src.run_inference \
run_name=my-experiment \
task=apps_comp_test \
solver=dycomp \
solver.params.decomp_budget=30 \
solver.params.alpha_fraction=0.15 \
solver.params.model=gpt-4o-mini \
solver.params.temperature=0.2 \
solver.params.split_metric=zscore \
top_k_problems=200- Identify pivotal prefixes via sampled continuations + outcome reward.
- Adapt granularity: hard prefixes trigger contract (finer steps); easy ones advance (coarser steps).
- Allocate compute where it matters: drive rollouts only when they improve a standardized score (e.g., z-score) over the current best prefix.
- Search integration: same decomposition policy governs node expansion in greedy/beam/MCTS.
Scripts for reproducing the results are provided in the src/scripts/ directory. The scripts are named according to the experiments they reproduce. Scripts assume you are not using SLURM for job scheduling. If you are using SLURM, see the src/scripts/slurm_version/ directory.
All scripts can be run directly from the repository root:
Compare DISC against baselines (BoN, newline decomposition, token decomposition) on different benchmarks:
APPS (Competition Problems):
bash src/scripts/apps_inference.shCompares different decomposition methods on APPS competition problems. Runs single generation, BoN, newline-based decomposition, token-based decomposition, and DISC with different split metrics.
MATH500:
bash src/scripts/math_inference.shMain comparisons on MATH500 benchmark with verifier-based rewards.
LiveCodeBench:
bash src/scripts/livecodebench_inference.shEvaluates different methods on LiveCodeBench with sandboxed unit tests.
Priority Metric Ablation:
bash src/scripts/apps_metric.shCompares different split metrics for DISC: mean, z-score, random, negative-mean, and negative-z-score.
Model Comparison:
bash src/scripts/apps_model.shCompares DISC performance across different LLM models:
- gpt-4o-mini
- gpt-4o
- Llama-3.1-8B
- Mistral-7B-v0.3
- DeepSeek-R1-Distill-Llama-8B
- Qwen-2.5-7B
Search Strategy Comparison:
bash src/scripts/apps_search.shCompares DISC with different search strategies: greedy (baseline), MCTS, and beam search with various beam sizes.
Temperature Ablation:
bash src/scripts/apps_temperature.shSweeps temperature values (0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4) comparing BoN vs. DISC.
Alpha Fraction Ablation:
bash src/scripts/apps_alpha_fraction.shAblation study on the alpha_fraction hyperparameter (0.05, 0.10, 0.15, 0.20, 0.25, 0.30).
When ground-truth tests are not available, DISC can use self-generated validation tests:
1. Generate validation tests:
bash src/scripts/apps_testgen.shGenerates validation tests for each problem and records them in a jsonl file under data/apps-testgen/. Uses the configuration in src/conf/auxgen.yaml.
2. Run experiments with self-generated tests:
bash src/scripts/apps_val.shRuns BoN and decomposition baselines using self-generated validation tests instead of ground-truth tests. Note: You need to update the TESTGEN_PATH variable in the script to point to the generated tests from step 1.
Experiments are configured using Hydra. Default configurations can be found in:
src/conf/inference.yaml- Main inference configurationsrc/conf/auxgen.yaml- Test generation configurationsrc/conf/solver/- Solver-specific configs (dycomp, bon, spchar, tokencomp, etc.)src/conf/task/- Task-specific configs (apps, math_500, livecodebench, etc.)
You can modify parameters by:
- Editing the configuration files directly
- Passing command-line overrides (e.g.,
solver.params.temperature=0.5) - Editing the shell scripts in
src/scripts/
By default, generated solutions are saved to data/generated_solutions/<task_name>/ directory. You can change this by modifying the solution_set_path parameter in the configuration file.
You can analyze results using the notebook src/data_analysis/produce_standard_charts_demo.ipynb. This notebook will:
- Load generated solutions from jsonl files
- Compute metrics used in the paper (pass@k, token usage, etc.)
- Generate plots comparing different methods
- Save plots to the root directory
Usage:
- Open
src/data_analysis/produce_standard_charts_demo.ipynb - Update the paths to your generated solutions jsonl files
- Run the notebook cells
- Plots will be generated and saved
Metrics:
- Primary: pass@k (k = 1, 10) for coding/math
- Secondary: token usage & sample count for compute efficiency
- Reward sources: unit tests (code), verifiers (math), self-critique (generic)
DISC supports both open-source and proprietary models:
Open-source models:
- LLaMA-3.1-8B
- Mistral-7B-v0.3
- Qwen-2.5-7B
- DeepSeek-R1-Distill-Llama-8B
Proprietary models:
- OpenAI (gpt-4o, gpt-4o-mini, etc.)
- Anthropic Claude (via API)
- Google Gemini (via API)
Model adapters are in src/llm_models/. To add a new model:
- Implement the model interface in
src/llm_models/ - Update configuration to reference the new model
- Ensure API keys are set in environment variables if needed
DISC uses Hydra for configuration management. Key parameters:
Solver parameters (DISC):
solver: dycomp
solver.params:
decomp_budget: 30 # Total sampling budget
alpha_fraction: 0.15 # Threshold for advancing/contracting
split_metric: zscore # Priority metric (mean|zscore|random|negmean|negzscore)
temperature: 0.2 # Sampling temperature
model: gpt-4o-mini # Model identifier
stop_sum_score: 1.0 # Stop when cumulative reward >= this valueTask parameters:
task: apps_comp_test # Task identifier
top_k_problems: 200 # Number of problems to solve (-1 for all)Search variants:
solver=dycomp- Greedy DISC (default)solver=dycomp_mcts- MCTS with DISC decompositionsolver=dycomp_beam- Beam search with DISC decomposition
Baseline solvers:
solver=simple- Single generationsolver=bon- Best-of-N samplingsolver=spchar- Newline-based decompositionsolver=tokencomp- Token-based decomposition
If you use DISC in your research, please cite our NeurIPS 2025 paper:
@inproceedings{light2025disc,
title = {{DISC}: Dynamic decomposition improves {LLM} inference scaling},
author = {Light, Jonathan and Cheng, Wei and Riviere, Benjamin and Wu, Yue and Oyamada, Masafumi and Wang, Mengdi and Yue, Yisong and Paternain, Santiago and Chen, Haifeng},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS 2025)},
year = {2025}
}For questions or issues, please open an issue on GitHub or contact the authors.
Core authors and affiliations are listed in the paper.