Add GPTQ support for block quantization#2533
Add GPTQ support for block quantization#2533brian-dellabetta merged 5 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the block quantization strategy within the GPTQ quantization process and includes a new unit test to verify this functionality. A performance optimization was suggested to move the constant block_width calculation outside of the inner loops to avoid redundant assignments.
7fcdfd6 to
43aaa3e
Compare
|
This PR is code complete and ready for review. Local validation run: $env:PYTHONPATH="src"
pytest -p no:inline_snapshot tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q |
brian-dellabetta
left a comment
There was a problem hiding this comment.
Thank you @zeel2104 for the nice PR! Did you have a chance to run this end-to-end? I'm curious to see how eval results compare for GPTQ FP8_BLOCK vs. round-to-nearest FP8_BLOCK. See for example this comment on the PR to add FP8_BLOCK to autoround.
lmk if you have questions on how to set that up. If you don't have access to any hardware, I can try on my side in the next couple days. Thanks!
6cf005e to
2971396
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Updated per review feedback to keep @brian-dellabetta If there’s a recommended internal machine, cluster, or standard benchmark command I should use, I’m happy to run it if I can get access. |
Signed-off-by: Zeel <[email protected]>
Signed-off-by: Zeel <[email protected]>
Signed-off-by: Zeel <[email protected]>
2971396 to
6aebc24
Compare
Hi @zeel2104 , thanks for updating. Unfortunately we don't expose pathways to test, but i will validate and confirm tomorrow and we can merge this in |
brian-dellabetta
left a comment
There was a problem hiding this comment.
I was able to run GPTQ FP8_BLOCK on this branch. Results look as expected -- GPTQ outperforms round-to-nearest, and AWQ performs worse because it is not well-suited for block-style quant schemes (more detail on that here):
round-to-nearest
vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-rtn-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext| 2|none | 0|bits_per_byte |↓ | 0.6240|± | N/A|
| | |none | 0|byte_perplexity|↓ | 1.5411|± | N/A|
| | |none | 0|word_perplexity|↓ |10.1034|± | N/A|
GPTQ
vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-gptq-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext| 2|none | 0|bits_per_byte |↓ | 0.6237|± | N/A|
| | |none | 0|byte_perplexity|↓ | 1.5408|± | N/A|
| | |none | 0|word_perplexity|↓ |10.0924|± | N/A|
AWQ
vllm ({'pretrained': 'Meta-Llama-3-8B-Instruct-awq-fp8block', 'tensor_parallel_size': 1, 'dtype': 'auto', 'gpu_memory_utilization': 0.8}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: auto
| Tasks |Version|Filter|n-shot| Metric | | Value | |Stderr|
|--------|------:|------|-----:|---------------|---|------:|---|------|
|wikitext| 2|none | 0|bits_per_byte |↓ | 0.6241|± | N/A|
| | |none | 0|byte_perplexity|↓ | 1.5412|± | N/A|
| | |none | 0|word_perplexity|↓ |10.1067|± | N/A|
Thanks @zeel2104 for the contribution!
|
closes #2520 |
## Summary Adds GPTQ support for `QuantizationStrategy.BLOCK`. Previously, GPTQ only handled tensor, channel, group, and tensor-group strategies in the weight quantization loop. This change adds block-wise handling by selecting the correct block-column quantization parameters for each GPTQ column update, while keeping the existing Hessian-based error propagation flow unchanged. ## Changes Made - added `QuantizationStrategy.BLOCK` support in `src/llmcompressor/modifiers/gptq/gptq_quantize.py` - quantized each GPTQ column as a 2D block slice using the matching block qparams - added a focused unit test in `tests/llmcompressor/modifiers/gptq/test_gptq_quantize.py` - verified existing GPTQ quantization config parsing tests still pass ## Test Plan Tested locally in the repo development environment. Commands run: ```bash $env:PYTHONPATH="src" pytest tests\llmcompressor\modifiers\gptq\test_gptq_quantize.py tests\llmcompressor\modifiers\quantization\test_base.py -q --------- Signed-off-by: Zeel <[email protected]> Co-authored-by: Brian Dellabetta <[email protected]> Signed-off-by: Ziming <[email protected]>
Summary
Adds GPTQ support for
QuantizationStrategy.BLOCK.Previously, GPTQ only handled tensor, channel, group, and tensor-group strategies in the weight quantization loop. This change adds block-wise handling by selecting the correct block-column quantization parameters for each GPTQ column update, while keeping the existing Hessian-based error propagation flow unchanged.
Changes Made
QuantizationStrategy.BLOCKsupport insrc/llmcompressor/modifiers/gptq/gptq_quantize.pytests/llmcompressor/modifiers/gptq/test_gptq_quantize.pyTest Plan
Tested locally in the repo development environment.
Commands run: