Skip to content

Add GPTQ support for block quantization#2533

Merged
brian-dellabetta merged 5 commits intovllm-project:mainfrom
zeel2104:feat/gptq-block-support
Mar 31, 2026
Merged

Add GPTQ support for block quantization#2533
brian-dellabetta merged 5 commits intovllm-project:mainfrom
zeel2104:feat/gptq-block-support

Conversation

@zeel2104
Copy link
Copy Markdown
Contributor

Summary

Adds GPTQ support for QuantizationStrategy.BLOCK.

Previously, GPTQ only handled tensor, channel, group, and tensor-group strategies in the weight quantization loop. This change adds block-wise handling by selecting the correct block-column quantization parameters for each GPTQ column update, while keeping the existing Hessian-based error propagation flow unchanged.

Changes Made

  • added QuantizationStrategy.BLOCK support in src/llmcompressor/modifiers/gptq/gptq_quantize.py
  • quantized each GPTQ column as a 2D block slice using the matching block qparams
  • added a focused unit test in tests/llmcompressor/modifiers/gptq/test_gptq_quantize.py
  • verified existing GPTQ quantization config parsing tests still pass

Test Plan

Tested locally in the repo development environment.

Commands run:

$env:PYTHONPATH="src"
pytest tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the block quantization strategy within the GPTQ quantization process and includes a new unit test to verify this functionality. A performance optimization was suggested to move the constant block_width calculation outside of the inner loops to avoid redundant assignments.

Comment thread src/llmcompressor/modifiers/gptq/gptq_quantize.py
@zeel2104 zeel2104 force-pushed the feat/gptq-block-support branch from 7fcdfd6 to 43aaa3e Compare March 28, 2026 21:16
@zeel2104
Copy link
Copy Markdown
Contributor Author

This PR is code complete and ready for review.

Local validation run:

$env:PYTHONPATH="src"
pytest -p no:inline_snapshot tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q

Copy link
Copy Markdown
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Comment thread src/llmcompressor/modifiers/gptq/gptq_quantize.py Outdated
@dsikka dsikka added the ready When a PR is ready for review label Mar 30, 2026
Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zeel2104 for the nice PR! Did you have a chance to run this end-to-end? I'm curious to see how eval results compare for GPTQ FP8_BLOCK vs. round-to-nearest FP8_BLOCK. See for example this comment on the PR to add FP8_BLOCK to autoround.

lmk if you have questions on how to set that up. If you don't have access to any hardware, I can try on my side in the next couple days. Thanks!

Comment thread src/llmcompressor/modifiers/gptq/gptq_quantize.py Outdated
@brian-dellabetta brian-dellabetta added gptq For any PR / issue related to GPTQ support labels Mar 30, 2026
@zeel2104 zeel2104 force-pushed the feat/gptq-block-support branch 2 times, most recently from 6cf005e to 2971396 Compare March 30, 2026 19:46
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 30, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @zeel2104.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Mar 30, 2026
@zeel2104
Copy link
Copy Markdown
Contributor Author

Updated per review feedback to keep block_width scoped inside the QuantizationStrategy.BLOCK branch for readability. Re-ran the targeted local tests successfully.

@brian-dellabetta
I don’t currently have access to suitable GPU hardware in this environment for a meaningful FP8_BLOCK end-to-end benchmark.

If there’s a recommended internal machine, cluster, or standard benchmark command I should use, I’m happy to run it if I can get access.

@zeel2104 zeel2104 force-pushed the feat/gptq-block-support branch from 2971396 to 6aebc24 Compare March 30, 2026 19:51
@mergify mergify Bot removed the needs-rebase label Mar 30, 2026
@brian-dellabetta
Copy link
Copy Markdown
Collaborator

@brian-dellabetta I don’t currently have access to suitable GPU hardware in this environment for a meaningful FP8_BLOCK end-to-end benchmark.

If there’s a recommended internal machine, cluster, or standard benchmark command I should use, I’m happy to run it if I can get access.

Hi @zeel2104 , thanks for updating. Unfortunately we don't expose pathways to test, but i will validate and confirm tomorrow and we can merge this in

Copy link
Copy Markdown
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run GPTQ FP8_BLOCK on this branch. Results look as expected -- GPTQ outperforms round-to-nearest, and AWQ performs worse because it is not well-suited for block-style quant schemes (more detail on that here):

round-to-nearest

vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-rtn-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6240|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5411|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.1034|±  |   N/A|

GPTQ

vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-gptq-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6237|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5408|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.0924|±  |   N/A|

AWQ

vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-awq-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks  |Version|Filter|n-shot|    Metric     |   | Value |   |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext|      2|none  |     0|bits_per_byte  |↓  | 0.6241|±  |   N/A|
|        |       |none  |     0|byte_perplexity|↓  | 1.5412|±  |   N/A|
|        |       |none  |     0|word_perplexity|↓  |10.1067|±  |   N/A|

Thanks @zeel2104 for the contribution!

@brian-dellabetta brian-dellabetta merged commit a5c58a0 into vllm-project:main Mar 31, 2026
13 checks passed
@HDCharles
Copy link
Copy Markdown
Collaborator

closes #2520

2imi9 pushed a commit to 2imi9/llm-compressor that referenced this pull request Apr 3, 2026
## Summary

Adds GPTQ support for `QuantizationStrategy.BLOCK`.

Previously, GPTQ only handled tensor, channel, group, and tensor-group
strategies in the weight quantization loop. This change adds block-wise
handling by selecting the correct block-column quantization parameters
for each GPTQ column update, while keeping the existing Hessian-based
error propagation flow unchanged.

## Changes Made

- added `QuantizationStrategy.BLOCK` support in
`src/llmcompressor/modifiers/gptq/gptq_quantize.py`
- quantized each GPTQ column as a 2D block slice using the matching
block qparams
- added a focused unit test in
`tests/llmcompressor/modifiers/gptq/test_gptq_quantize.py`
- verified existing GPTQ quantization config parsing tests still pass

## Test Plan

Tested locally in the repo development environment.

Commands run:

```bash
$env:PYTHONPATH="src"
pytest tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q

---------

Signed-off-by: Zeel <[email protected]>
Co-authored-by: Brian Dellabetta <[email protected]>
Signed-off-by: Ziming <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gptq For any PR / issue related to GPTQ support ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants