Add GPTQ support for block quantization by zeel2104 · Pull Request #2533 · vllm-project/llm-compressor

zeel2104 · 2026-03-28T17:29:10Z

Summary

Adds GPTQ support for QuantizationStrategy.BLOCK.

Previously, GPTQ only handled tensor, channel, group, and tensor-group strategies in the weight quantization loop. This change adds block-wise handling by selecting the correct block-column quantization parameters for each GPTQ column update, while keeping the existing Hessian-based error propagation flow unchanged.

Changes Made

added QuantizationStrategy.BLOCK support in src/llmcompressor/modifiers/gptq/gptq_quantize.py
quantized each GPTQ column as a 2D block slice using the matching block qparams
added a focused unit test in tests/llmcompressor/modifiers/gptq/test_gptq_quantize.py
verified existing GPTQ quantization config parsing tests still pass

Test Plan

Tested locally in the repo development environment.

Commands run:

$env:PYTHONPATH="src"
pytest tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q

github-actions · 2026-03-28T17:29:19Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request introduces support for the block quantization strategy within the GPTQ quantization process and includes a new unit test to verify this functionality. A performance optimization was suggested to move the constant block_width calculation outside of the inner loops to avoid redundant assignments.

zeel2104 · 2026-03-28T21:21:04Z

This PR is code complete and ready for review.

Local validation run:

$env:PYTHONPATH="src"
pytest -p no:inline_snapshot tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q

dsikka

LGTM! Thank you!

brian-dellabetta

Thank you @zeel2104 for the nice PR! Did you have a chance to run this end-to-end? I'm curious to see how eval results compare for GPTQ FP8_BLOCK vs. round-to-nearest FP8_BLOCK. See for example this comment on the PR to add FP8_BLOCK to autoround.

lmk if you have questions on how to set that up. If you don't have access to any hardware, I can try on my side in the next couple days. Thanks!

mergify · 2026-03-30T19:47:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zeel2104.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zeel2104 · 2026-03-30T19:48:50Z

Updated per review feedback to keep block_width scoped inside the QuantizationStrategy.BLOCK branch for readability. Re-ran the targeted local tests successfully.

@brian-dellabetta
I don’t currently have access to suitable GPU hardware in this environment for a meaningful FP8_BLOCK end-to-end benchmark.

If there’s a recommended internal machine, cluster, or standard benchmark command I should use, I’m happy to run it if I can get access.

Signed-off-by: Zeel <[email protected]>

brian-dellabetta · 2026-03-30T22:01:23Z

@brian-dellabetta I don’t currently have access to suitable GPU hardware in this environment for a meaningful FP8_BLOCK end-to-end benchmark.

If there’s a recommended internal machine, cluster, or standard benchmark command I should use, I’m happy to run it if I can get access.

Hi @zeel2104 , thanks for updating. Unfortunately we don't expose pathways to test, but i will validate and confirm tomorrow and we can merge this in

brian-dellabetta

I was able to run GPTQ FP8_BLOCK on this branch. Results look as expected -- GPTQ outperforms round-to-nearest, and AWQ performs worse because it is not well-suited for block-style quant schemes (more detail on that here):

round-to-nearest

vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-rtn-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6240|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5411|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.1034|±  |   N/A|

GPTQ

vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-gptq-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6237|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5408|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.0924|±  |   N/A|

AWQ

vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-awq-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6241|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5412|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.1067|±  |   N/A|

Thanks @zeel2104 for the contribution!

HDCharles · 2026-04-01T18:19:43Z

closes #2520

## Summary Adds GPTQ support for `QuantizationStrategy.BLOCK`. Previously, GPTQ only handled tensor, channel, group, and tensor-group strategies in the weight quantization loop. This change adds block-wise handling by selecting the correct block-column quantization parameters for each GPTQ column update, while keeping the existing Hessian-based error propagation flow unchanged. ## Changes Made - added `QuantizationStrategy.BLOCK` support in `src/llmcompressor/modifiers/gptq/gptq_quantize.py` - quantized each GPTQ column as a 2D block slice using the matching block qparams - added a focused unit test in `tests/llmcompressor/modifiers/gptq/test_gptq_quantize.py` - verified existing GPTQ quantization config parsing tests still pass ## Test Plan Tested locally in the repo development environment. Commands run: ```bash $env:PYTHONPATH="src" pytest tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q --------- Signed-off-by: Zeel <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]> Signed-off-by: Ziming <[email protected]>

gemini-code-assist Bot reviewed Mar 28, 2026

View reviewed changes

Comment thread src/llmcompressor/modifiers/gptq/gptq_quantize.py

zeel2104 force-pushed the feat/gptq-block-support branch from 7fcdfd6 to 43aaa3e Compare March 28, 2026 21:16

dsikka approved these changes Mar 30, 2026

View reviewed changes

Comment thread src/llmcompressor/modifiers/gptq/gptq_quantize.py Outdated

dsikka added the ready When a PR is ready for review label Mar 30, 2026

brian-dellabetta reviewed Mar 30, 2026

View reviewed changes

Comment thread src/llmcompressor/modifiers/gptq/gptq_quantize.py Outdated

brian-dellabetta added gptq For any PR / issue related to GPTQ support labels Mar 30, 2026

zeel2104 force-pushed the feat/gptq-block-support branch 2 times, most recently from 6cf005e to 2971396 Compare March 30, 2026 19:46

mergify Bot added the needs-rebase label Mar 30, 2026

zeel2104 added 3 commits March 30, 2026 15:50

Add GPTQ support for block quantization

81262aa

Signed-off-by: Zeel <[email protected]>

Add GPTQ support for block quantization

ed70c43

Signed-off-by: Zeel <[email protected]>

Improve GPTQ block quantization readability

6aebc24

Signed-off-by: Zeel <[email protected]>

zeel2104 force-pushed the feat/gptq-block-support branch from 2971396 to 6aebc24 Compare March 30, 2026 19:51

mergify Bot removed the needs-rebase label Mar 30, 2026

Merge branch 'main' into feat/gptq-block-support

157c6cb

brian-dellabetta approved these changes Mar 31, 2026

View reviewed changes

Merge branch 'main' into feat/gptq-block-support

580577e

brian-dellabetta merged commit a5c58a0 into vllm-project:main Mar 31, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPTQ support for block quantization#2533

Add GPTQ support for block quantization#2533
brian-dellabetta merged 5 commits intovllm-project:mainfrom
zeel2104:feat/gptq-block-support

zeel2104 commented Mar 28, 2026

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

zeel2104 commented Mar 28, 2026

Uh oh!

dsikka left a comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

mergify Bot commented Mar 30, 2026

Uh oh!

zeel2104 commented Mar 30, 2026

Uh oh!

brian-dellabetta commented Mar 30, 2026

Uh oh!

brian-dellabetta left a comment

Uh oh!

Uh oh!

HDCharles commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zeel2104 commented Mar 28, 2026

Summary

Changes Made

Test Plan

Uh oh!

github-actions Bot commented Mar 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

zeel2104 commented Mar 28, 2026

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Mar 30, 2026

Uh oh!

zeel2104 commented Mar 30, 2026

Uh oh!

brian-dellabetta commented Mar 30, 2026

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HDCharles commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants