TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

This is the official implementation of the ICLR 2026 paper:

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Tunyu Zhang*, Haizhou Shi*, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, Dimitris Metaxas, Hao Wang

*Equal contribution

Overview

TokUR is a training-free token-level uncertainty estimation framework for LLM reasoning. It introduces low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and aggregates these quantities to capture the semantic uncertainty of generated responses.

Key features:

Token-level uncertainty decomposition into aleatoric uncertainty (data randomness) and epistemic uncertainty (model uncertainty) with theoretical guarantees.
Training-free: Works with any pre-trained LLM without retraining or fine-tuning.
Practical applications: Incorrect reasoning path detection, high-quality solution selection, and uncertainty-guided generation for test-time scaling.
vLLM integration: Efficient batched inference via seamless vLLM support.

Installation

1. Clone the repository

git clone https://github.com/Wang-ML-Lab/TokUR.git
cd TokUR

2. Install the forked vLLM (required)

TokUR requires a forked version of vLLM (v0.7.3) that supports Bayesian weight perturbation during decoding:

git clone https://github.com/haizhou-shi/vllm.git
cd vllm
export VLLM_COMMIT=61c6a5a79664882a8ab1c9af3ff78677911516dc # use full commit hash from the main branch
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/${VLLM_COMMIT}/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl
pip install --editable .

3. Install dependencies

pip install -r requirements.txt

4. Install the Bayesian Transformer package

cd bayesian_transformer
pip install -e .
cd ..

Important: The enforce_eager=True flag must be used when initializing the vLLM model to disable CUDAGraph compilation, which would otherwise disable the Bayesian sampling process.

Quick Start

import bayesian_transformer
from vllm import LLM, SamplingParams

# Load a TFB (Training-Free Bayesian) model
# Use a HuggingFace model ID or a local path to a converted model
model = LLM(
    "/path/to/TFB-Llama-3.2-1B-Instruct",  # or "n1h111sm/TFB-Qwen2.5-3B-Instruct"
    enforce_eager=True,
)

# Generate with uncertainty estimation
prompt = "Solve: What is 15% of 240?"
sampling_params = SamplingParams(
    temperature=0.0,
    max_tokens=1024,
    logprobs=1,
)
output = model.generate(prompt, sampling_params)
print(output[0].outputs[0].text)

Supported Models

Model	HuggingFace ID	Architecture
TFB-Qwen2.5-3B-Instruct	`n1h111sm/TFB-Qwen2.5-3B-Instruct`	Qwen 2.5

Note: Due to licensing restrictions, we cannot publicly release the TFB weights for Llama models. You can convert Llama (or any other supported model) to a TFB model locally using the conversion script below.

Converting Base Models to TFB

We provide convert_to_tfb.py to convert any supported base model into a TFB model by computing SVD basis vectors for the attention layers (q_proj, v_proj).

# Convert a Llama model
python convert_to_tfb.py \
    --model-path /path/to/Meta-Llama-3.2-1B-Instruct \
    --output-path /path/to/TFB-Llama-3.2-1B-Instruct \
    --architecture llama \
    --rank 8

# Convert a Qwen2 model
python convert_to_tfb.py \
    --model-path /path/to/Qwen2.5-3B-Instruct \
    --output-path /path/to/TFB-Qwen2.5-3B-Instruct \
    --architecture qwen2 \
    --rank 8

Arguments:

--model-path: Path to the base HuggingFace model directory (must contain .safetensors files)
--output-path: Where to save the converted TFB model
--architecture: Model architecture (llama or qwen2)
--rank: Rank for low-rank basis vectors (default: 8)
--bayes-noise: Noise direction, right (default) or left

The script handles both single-file and sharded (multi-file) models automatically. The converted model can be loaded directly with vLLM or HuggingFace as shown in the Quick Start section.

Data Preparation

We provide the datasets used in our experiments via HuggingFace. To download them:

cd datasets
python download_data.py
cd ..

This downloads MATH500, GSM8K (test set), DeepScaleR (subset), and Leg Counting (subset) in JSONL format.

Reproducing Results

Incorrect Reasoning Path Detection (Table 1-3)

Step 1: Generate responses with uncertainty estimation

Run greedy decoding with TokUR uncertainty on multiple GPUs:

# Set MODEL_BASE_DIR to your local model directory
export MODEL_BASE_DIR=/path/to/your/models

bash bash_scripts/unc_greedy_single_para_batch.sh

Or run manually:

CUDA_VISIBLE_DEVICES=$GPU python run/greedy_unc_single_batch_refine.py \
    --dataset-path "datasets/math500.jsonl" \
    --dataset-start 0 \
    --dataset-end 500 \
    --model-path /path/to/TFB-Llama3.2-1B-Instruct \
    --output-dir ./results/llama1b_results_vllm_pg/math500/seed96/greedy_unc \
    --seed 96 \
    --batch-size 16

Step 2: Evaluate uncertainty quality

bash bash_scripts/eval_detect.sh math500 llama1b 96 89 64

Or run the evaluation script directly:

python eval/eval_detect_multi_seed.py \
    --dataset "math500" \
    --model "llama1b" \
    --results_subdir "greedy_unc" \
    --seeds 96 89 64

Results (AUROC, AUPRC, Top-50% ACC) are saved to results/eval/.

Test-Time Scaling (Table 4)

Step 1: Generate multiple candidate responses

export MODEL_BASE_DIR=/path/to/your/models

bash bash_scripts/unc_greedy_para.sh

Step 2: Evaluate scaling performance

bash bash_scripts/run_scaling_multi_gpu.sh \
    --model llama1b \
    --dataset math500 \
    --seed 96

Project Structure

TokUR/
├── bayesian_transformer/          # Installable package for TFB models
│   ├── bayesian_transformer/
│   │   ├── __init__.py            # Auto-registration of models
│   │   ├── config.py              # BayesianLM configuration
│   │   ├── layers.py              # Low-rank Bayesian linear layers (core)
│   │   ├── model.py               # Bayesian wrapper for HuggingFace models
│   │   └── vllm_models/           # vLLM-optimized implementations
│   │       ├── tfb_llama.py       # TFB Llama (transformers)
│   │       ├── tfb_llama_vllm.py  # TFB Llama (vLLM)
│   │       └── tfb_qwen2_vllm.py  # TFB Qwen2 (vLLM)
│   └── setup.py
├── bash_scripts/                  # Shell scripts for experiments
│   ├── unc_greedy_single_para_batch.sh  # Single greedy generation
│   ├── unc_greedy_para.sh               # Multi-particle generation
│   ├── eval_detect.sh                   # Detection evaluation
│   └── run_scaling_multi_gpu.sh         # Test-time scaling evaluation
├── datasets/                      # Dataset download utilities
│   └── download_data.py
├── eval/                          # Evaluation scripts
│   ├── eval_detect_multi_seed.py  # Multi-seed detection evaluation
│   └── eval_scaling_test_multi_gpu.py  # Multi-GPU scaling evaluation (Table 2)
├── run/                           # Inference scripts
│   ├── greedy_unc_single_batch_refine.py  # Batch greedy + uncertainty
│   ├── greedy_responses_unc.py            # Per-sample uncertainty
│   └── utils/                             # Shared utilities
│       ├── config.py              # Experiment configuration
│       ├── grader.py              # Math answer grading
│       ├── math.py                # Math aggregation utilities
│       └── qwen_math_parser.py    # Answer extraction & parsing
├── convert_to_tfb.py              # Convert base models to TFB format
├── requirements.txt
├── LICENSE
└── README.md

Citation

If you find this work useful, please cite our paper:

@inproceedings{TokUR,
  title={TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning},
  author={Zhang, Tunyu and Shi, Haizhou and Wang, Yibin and Wang, Hengyi and He, Xiaoxiao and Li, Zhuowei and Chen, Haoxian and Han, Ligong and Xu, Kai and Zhang, Huan and Metaxas, Dimitris and Wang, Hao},
  booktitle={International Conference on Learning Representations},
  year={2026}
}

Acknowledgements

This work builds on several open-source projects:

vLLM for efficient LLM inference
HuggingFace Transformers for model infrastructure
BLoB and TFB for Bayesian LLM foundations

License

This project is licensed under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Overview

Installation

1. Clone the repository

2. Install the forked vLLM (required)

3. Install dependencies

4. Install the Bayesian Transformer package

Quick Start

Supported Models

Converting Base Models to TFB

Data Preparation

Reproducing Results

Incorrect Reasoning Path Detection (Table 1-3)

Test-Time Scaling (Table 4)

Project Structure

Citation

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bash_scripts		bash_scripts
bayesian_transformer		bayesian_transformer
datasets		datasets
eval		eval
run		run
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_to_tfb.py		convert_to_tfb.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

Overview

Installation

1. Clone the repository

2. Install the forked vLLM (required)

3. Install dependencies

4. Install the Bayesian Transformer package

Quick Start

Supported Models

Converting Base Models to TFB

Data Preparation

Reproducing Results

Incorrect Reasoning Path Detection (Table 1-3)

Test-Time Scaling (Table 4)

Project Structure

Citation

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages