ParoQuant is an efficient 4-bit weight-only quantization method that achieves state-of-the-art quantization accuracy while incurring minimal overhead during inference. It currently supports LLaMA and Qwen3 model family.
Try out ParoQuant models with a single command:
docker run --rm -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO
For platforms with compute capability ≥ 12.1 (e.g. NVIDIA DGX Spark), please use ghcr.io/z-lab/paroquant:chat-cu130 instead.
We recommend using the docker image ghcr.io/z-lab/paroquant:latest without manually setting up environment:
docker run -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:latest
Please follow the setup instructions below if you'd prefer running on the host.
Clone this repository:
git clone https://github.com/z-lab/paroquant
cd paroquantInstall dependencies:
# use conda (recommended)
conda env create -f environment.yml
conda activate paroquant
pip install ./kernels --no-build-isolation
# or use pip
pip install -r requirements.txt
pip install ./kernels --no-build-isolationYou may need to modify requirements.txt to match your CUDA version.
First, run the optimization script to obtain the optimized checkpoints. The checkpoints will be stored in output/<model_name>.
experiments/optimize/4bit.sh Qwen/Qwen3-8BThen, create a huggingface model with pseudo quantization (i.e., model weights are in FP16 simulating the quantization) or real quantization (i.e., model weights are in INT4):
# pseudo quantization
python3 scripts/pseudo_quant.py \
--model Qwen/Qwen3-8B \
--result-dir output/Qwen3-8B \
--output-path models/Qwen3-8B-PARO-pseudo
# real quantization
python3 scripts/real_quant.py \
--model Qwen/Qwen3-8B \
--result-dir output/Qwen3-8B \
--output-path models/Qwen3-8B-PAROThe docker image for interactive inference is ghcr.io/z-lab/paroquant:chat. Install vLLM if you are running on the host:
pip install vllm==0.15.1To run a real-quantized model with vLLM and open an interactive chat:
# with docker
docker run --rm -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3-8B-PARO
# without docker
python3 scripts/interactive_gen.py --model z-lab/Qwen3-8B-PAROAdd --backend transformers to run with the Transformers backend instead. Please note that Transformers suffers from performance degradation with long generations.
We provide pre-quantized 4-bit ParoQuant models listed below. These are real-quantized models and can be loaded with the method described above.
| Model | Hugging Face Path |
|---|---|
| Meta-Llama-3-8B | z-lab/Meta-Llama-3-8B-PARO |
| Meta-Llama-3-70B | z-lab/Meta-Llama-3-70B-PARO |
| Llama-3.1-8B-Instruct | z-lab/Llama-3.1-8B-Instruct-PARO |
| Llama-2-7b-hf | z-lab/Llama-2-7b-hf-PARO |
| Qwen3-0.6B | z-lab/Qwen3-0.6B-PARO |
| Qwen3-1.7B | z-lab/Qwen3-1.7B-PARO |
| Qwen3-4B | z-lab/Qwen3-4B-PARO |
| Qwen3-8B | z-lab/Qwen3-8B-PARO |
| Qwen3-14B | z-lab/Qwen3-14B-PARO |
| Qwen3-0.6B-Base | z-lab/Qwen3-0.6B-Base-PARO |
| Qwen3-1.7B-Base | z-lab/Qwen3-1.7B-Base-PARO |
| Qwen3-4B-Base | z-lab/Qwen3-4B-Base-PARO |
| Qwen3-8B-Base | z-lab/Qwen3-8B-Base-PARO |
| Qwen3-14B-Base | z-lab/Qwen3-14B-Base-PARO |
| Qwen3-4B-Thinking-2507 | z-lab/Qwen3-4B-Thinking-2507-PARO |
| DeepSeek-R1-Distill-Llama-8B | z-lab/DeepSeek-R1-Distill-Llama-8B-PARO |
In addition, we provide the original checkpoints and pseudo-quantized models in z-lab/paroquant-checkpoints to facilitate reproduction and further research.
In the experiments directory, we provide the original scripts that produce the models, experiment results, and figures in the paper. Please refer to the README for more details.
We provide three docker images for easy environment setup:
ghcr.io/z-lab/paroquant:latestfor optimization and non-reasoning task evaluationghcr.io/z-lab/paroquant:chatfor running the chat appghcr.io/z-lab/paroquant:chat-cu130for running the chat app with CUDA 13.0ghcr.io/z-lab/paroquant:eval-reasoningfor reasoning task evaluation
Use the following command to create a container and activate an interactive shell:
docker run -it --gpus all --ipc=host ghcr.io/z-lab/paroquant:<tag>
Contributions are welcome! Please install pre-commit to ensure consistent code styles:
pip install pre-commit
pre-commit installIf you find ParoQuant useful or relevant to your research, please kindly cite our paper:
@inproceedings{liang2026paroquant,
title = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
author = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}