ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

ParoQuant is an efficient 4-bit weight-only quantization method that achieves state-of-the-art quantization accuracy while incurring minimal overhead during inference. It currently supports LLaMA and Qwen3 model family.

Quick Start

Try out ParoQuant models with a single command:

docker run --rm -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO

For platforms with compute capability ≥ 12.1 (e.g. NVIDIA DGX Spark), please use ghcr.io/z-lab/paroquant:chat-cu130 instead.

Setup

We recommend using the docker image ghcr.io/z-lab/paroquant:latest without manually setting up environment:

docker run -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:latest

Please follow the setup instructions below if you'd prefer running on the host.

Clone this repository:

git clone https://github.com/z-lab/paroquant
cd paroquant

Install dependencies:

# use conda (recommended)
conda env create -f environment.yml
conda activate paroquant
pip install ./kernels --no-build-isolation

# or use pip
pip install -r requirements.txt
pip install ./kernels --no-build-isolation

You may need to modify requirements.txt to match your CUDA version.

Usage

Optimization

First, run the optimization script to obtain the optimized checkpoints. The checkpoints will be stored in output/<model_name>.

experiments/optimize/4bit.sh Qwen/Qwen3-8B

Then, create a huggingface model with pseudo quantization (i.e., model weights are in FP16 simulating the quantization) or real quantization (i.e., model weights are in INT4):

# pseudo quantization
python3 scripts/pseudo_quant.py \
    --model Qwen/Qwen3-8B \
    --result-dir output/Qwen3-8B \
    --output-path models/Qwen3-8B-PARO-pseudo

# real quantization
python3 scripts/real_quant.py \
    --model Qwen/Qwen3-8B \
    --result-dir output/Qwen3-8B \
    --output-path models/Qwen3-8B-PARO

Inference

The docker image for interactive inference is ghcr.io/z-lab/paroquant:chat. Install vLLM if you are running on the host:

pip install vllm==0.15.1

To run a real-quantized model with vLLM and open an interactive chat:

# with docker
docker run --rm -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO

# without docker
python3 scripts/interactive_gen.py --model z-lab/Qwen3-8B-PARO

Add --backend transformers to run with the Transformers backend instead. Please note that Transformers suffers from performance degradation with long generations.

Models

We provide pre-quantized 4-bit ParoQuant models listed below. These are real-quantized models and can be loaded with the method described above.

Model	Hugging Face Path
Meta-Llama-3-8B	`z-lab/Meta-Llama-3-8B-PARO`
Meta-Llama-3-70B	`z-lab/Meta-Llama-3-70B-PARO`
Llama-3.1-8B-Instruct	`z-lab/Llama-3.1-8B-Instruct-PARO`
Llama-2-7b-hf	`z-lab/Llama-2-7b-hf-PARO`
Qwen3-0.6B	`z-lab/Qwen3-0.6B-PARO`
Qwen3-1.7B	`z-lab/Qwen3-1.7B-PARO`
Qwen3-4B	`z-lab/Qwen3-4B-PARO`
Qwen3-8B	`z-lab/Qwen3-8B-PARO`
Qwen3-14B	`z-lab/Qwen3-14B-PARO`
Qwen3-0.6B-Base	`z-lab/Qwen3-0.6B-Base-PARO`
Qwen3-1.7B-Base	`z-lab/Qwen3-1.7B-Base-PARO`
Qwen3-4B-Base	`z-lab/Qwen3-4B-Base-PARO`
Qwen3-8B-Base	`z-lab/Qwen3-8B-Base-PARO`
Qwen3-14B-Base	`z-lab/Qwen3-14B-Base-PARO`
Qwen3-4B-Thinking-2507	`z-lab/Qwen3-4B-Thinking-2507-PARO`
DeepSeek-R1-Distill-Llama-8B	`z-lab/DeepSeek-R1-Distill-Llama-8B-PARO`

In addition, we provide the original checkpoints and pseudo-quantized models in z-lab/paroquant-checkpoints to facilitate reproduction and further research.

Reproduction

In the experiments directory, we provide the original scripts that produce the models, experiment results, and figures in the paper. Please refer to the README for more details.

Docker

We provide three docker images for easy environment setup:

ghcr.io/z-lab/paroquant:latest for optimization and non-reasoning task evaluation
ghcr.io/z-lab/paroquant:chat for running the chat app
ghcr.io/z-lab/paroquant:chat-cu130 for running the chat app with CUDA 13.0
ghcr.io/z-lab/paroquant:eval-reasoning for reasoning task evaluation

Use the following command to create a container and activate an interactive shell:

docker run -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:<tag>

Contribution

Contributions are welcome! Please install pre-commit to ensure consistent code styles:

pip install pre-commit
pre-commit install

Reference

If you find ParoQuant useful or relevant to your research, please kindly cite our paper:

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
assets		assets
docker		docker
experiments		experiments
inference_engine		inference_engine
kernels		kernels
paroquant		paroquant
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
optimize.py		optimize.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Quick Start

Setup

Usage

Optimization

Inference

Models

Reproduction

Docker

Contribution

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

z-lab/paroquant

Folders and files

Latest commit

History

Repository files navigation

ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Quick Start

Setup

Usage

Optimization

Inference

Models

Reproduction

Docker

Contribution

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages