GPTQModel v1.7.4
What's Changed
⚡ Faster packing for post-quantization model weight save.
⚡ Triton kernel now validated for Intel/XPU when Intel Triton package is installed.
⚡ New compile() api that allows torch to improve tps by ~4-8%. May need to disable flash_attention for some kernels.
🐛 Fix HF Transformers bug of downcasting fast tokenizer class on save.
🐛 Fix inaccurate bpw calculations.
🐛 Fix ROCm compile with setup.py
- Fix exllama slow pack() by @CSY-ModelCloud in #1128
- use optimized torch.round() codes by @CSY-ModelCloud in #1131
- fix shape mismatch for packing by @CSY-ModelCloud in #1132
- Speed up triton dequant by @Qubitium in #1136
- add torch compile with backend aot_ts by @CSY-ModelCloud in #1139
- disable sampling by @Qubitium in #1141
- mod triton-xpu by @CL-ModelCloud in #1135
- supress dynamo error by @CSY-ModelCloud in #1143
- fix bpw by @CL-ModelCloud in #1150
- [FIX] fix incorrectly saved the slow tokenizer by @LRL-ModelCloud in #1151
- Add mod chat by @CL-ModelCloud in #1154
- optimize pack by @Qubitium in #1153
- add quant time test by @CL-ModelCloud in #1155
- Export to hf model by @LRL-ModelCloud in #1157
- Fix bpw calculation by @Qubitium in #1163
- Inference speed test by @CL-ModelCloud in #1159
New Contributors
Full Changelog: v1.7.3...v1.7.4