Releases: ModelCloud/GPTQModel
Releases · ModelCloud/GPTQModel
v5.6.6
Notable Changes:
- Use static cuda ctx for triton kernel launch by @Qubitium in #2269
- Remove random-word depend by @LRL2-ModelCloud in #2266
- Update PyPcre depend from 0.2.7 to 0.2.8 by @Qubitium in #2267
What's Changed
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2265
- Update version.py by @Qubitium in #2268
- Ready 5.6.6 by @Qubitium in #2270
Full Changelog: v5.6.2...v5.6.6
GPT-QModel v5.6.4
What's Changed
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2265
- remove random-word depend by @LRL2-ModelCloud in #2266
- Update pypcre version from 0.2.7 to 0.2.8 by @Qubitium in #2267
- Update version.py by @Qubitium in #2268
Full Changelog: v5.6.2...v5.6.4
GPT-QModel v5.6.2
Notable Changes
- FIX JIT Pytorch extension
pack_cpu_extstall by @ZX-ModelCloud in #2248 - Refractor Kernel External Dependency Validation by @LRL2-ModelCloud in #2249
- FIX some models not honoring model.config.use_cache by force pass use_cache=false by @LRL2-ModelCloud in #2246
- FIX Incorrect Triton dequant_kernel for 3-bit GPTQ (INT3) leads to Triton compile error / wrong dequantization #2251 by
- Support llm-awq by @ZX-ModelCloud in #2252
What's Changed
- Update version.py by @Qubitium in #2247
- Update README.md by @davedgd in #2250
- [CI] add torch 2.9.1 by @CSY-ModelCloud in #2254
@KingdalfGoodman in #2258 - Update license declaration in pyproject.toml by @CSY-ModelCloud in #2259
- Modify setup by @Qubitium in #2260
- Add release notes for version 5.6.2 by @Qubitium in #2261
- fix test_quant_formats.py by @LRL2-ModelCloud in #2262
- [CI] mount dateset dir to /monster/data/model/dataset by @CSY-ModelCloud in #2263
- fix parsing args by @CSY-ModelCloud in #2264
New Contributors
- @KingdalfGoodman made their first contribution in #2258
Full Changelog: v5.6.0...v5.6.2
GPT-QModel v5.6.0
Notable Changes:
- HF Kernel for CPU: AMX, AVX2, AVX512 optimized by @jiqing-feng in #2232
- Fix: Resolve performance regression during initial forward pass with offload_to_disk by @avtc in #2239
- Auto module tree by @LRL2-ModelCloud in #2204
- Afmoe support by @LRL2-ModelCloud in #2243
- Add dots1 by @Qubitium in #2231
What's Changed
- Update description and code about GPTAQ in README.md by @wayneguow in #2202
- Update test cases for qwen2.5-vl and qwen3-vl by @wayneguow in #2203
- Optimize minimax m2 modelling forward pass by @avtc in #2176
- remove gemm ipex by @LRL2-ModelCloud in #2206
- Bump actions/checkout from 5 to 6 in the github-actions group by @dependabot[bot] in #2207
- Update device-smi dependency version to 0.5.2 by @Qubitium in #2208
- Fix loading an AWQ-quantized model with GPTQModel when it is not actu… by @LRL2-ModelCloud in #2209
- fix exllama v2 post init by @LRL2-ModelCloud in #2211
- [FIX] Add fallback for "module_dir" and "entry key" lookup by @ZX-ModelCloud in #2210
- Update unit_tests.yml by @Qubitium in #2213
- fix mps backend does not implement float64 by @Qubitium in #2216
- [FIX] _apply_quant() not being called with awq by @ZX-ModelCloud in #2218
- Fix AWQ Extension by @LRL2-ModelCloud in #2217
- Auto AWQ kernel selection for Transformers compat by @Qubitium in #2214
- Fix add bias for torch_fuse by @jiqing-feng in #2223
- [CI] Add torch_fused test with Bias by @ZX-ModelCloud in #2222
- [FIX] device_map with cpu only causing
CpuOffloadhooks to be injected by @ZX-ModelCloud in #2225 - fix awq apply_scale and apply_clip multi thread issue by @LRL2-ModelCloud in #2224
- Fix CI test not pasing by @Qubitium in #2226
- Monkeypatch lm-eval latest broken imports by @Qubitium in #2227
- make file can be pytest called by @CSY-ModelCloud in #2228
- CI Fix awq weight mean by @LRL2-ModelCloud in #2229
- fix pycharm auto imported wrong path by @CSY-ModelCloud in #2230
- [FIX] TorchFusedAwqQuantLinear selection by @ZX-ModelCloud in #2233
- [CI] update CI path by @CSY-ModelCloud in #2236
- [Model] Mistral3 support by @LRL2-ModelCloud in #2238
- Update setup.py by @Qubitium in #2240
- Increase MAX_JOBS from 4 to 8 in release.yml by @Qubitium in #2241
- [FIX] non-peristent buffer was saved incorrectly by @ZX-ModelCloud in #2242
New Contributors
- @wayneguow made their first contribution in #2202
GPT-QModel v5.4.2
Notable Changes:
- Fix double fwd regression by @Qubitium in #2198
- Add cli: gptqmodel env by @ZX-ModelCloud in #2192
- [CI] compile wheel with python -m build by @CSY-ModelCloud in #2193
What's Changed
- Start v5.5.0 devel branch (odd version) by @Qubitium in #2191
- Update version from 5.5.0 to 5.4.2 patch release by @Qubitium in #2199
- [CI] copy wheel to local dir instead of using http server by @CSY-ModelCloud in #2200
Full Changelog: v5.4.0...v5.4.2
GPT-QModel v5.4.0
Notable Changes:
- AWQ Torch Fused Kernel by @Qubitium in #2190
- Make torch fused op compilable by @jiqing-feng in #2182
- [FIX] AWQ MoE by @ZX-ModelCloud in #2171
- add :? capture only syntax by @Qubitium in #2173
What's Changed
- Update latest news section in README.md by @Qubitium in #2166
- run forward pass even for empty subset to produce correct layer outputs by @avtc in #2161
- Reduce AWQ memory usage by @Qubitium in #2167
- Awq update by @Qubitium in #2168
- Retry partial to to fix accelerate invalid argument for first moe layer (reapply) by @avtc in #2169
- Awq update by @Qubitium in #2172
- adjust retry partial.to by @avtc in #2175
- cleanup awq_get_modules_for_scaling() by @ZX-ModelCloud in #2179
- [FIX] qwen3 moe sparse moe block by @ZX-ModelCloud in #2184
- Add module convert by @LRL2-ModelCloud in #2183
- Cleanup by @Qubitium in #2185
- Update pypcre version to 0.2.5 by @LRL2-ModelCloud in #2186
- Update pypcre version to 0.2.5 by @Qubitium in #2189
- [FIX] version("triton") crash on torch+xpu by @ZX-ModelCloud in #2188
Full Changelog: v5.2.0...v5.4.0
GPT-QModel v5.2.0
Notable Changes:
- Minimax M2, Granite Nano, Qwen3-VL, Brumpy model support
AWQquantization now out of beta and now fully integrated into life cycle- New
VramStrategy.Balancedproperty to spreadMoEmodules to different gpus - New pure torch AWQ kernel
- New
calibration_concat_separatorproperty - Fixed HF bug that did not save
mtplayers for GLM 4.5/4.6 (air) models. - Fixed multi-gpu cuda asserts due to stream/sync
What's Changed
- try not adding mem guards for marlin kernel launch protection by @Qubitium in https://github.com/ModelCloud/GPTQModel/*pull/2108
- MoE vram by @Qubitium in #2110
- Fix GLM 4.5/4.6 and AIr not saving mtp layer after save (HF bug) by @LRL2-ModelCloud in #2109
- torchao 0.14.1 update by @Qubitium in #2111
- Test refractor by @Qubitium in #2113
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2120
- [FIX] xpu unit test by @ZX-ModelCloud in #2122
- modular by @Qubitium in #2123
- update scores by @Qubitium in #2124
- Fp8 dequant by @Qubitium in #2125
- Model dequant by @Qubitium in #2126
- Fp4 e2m1 by @Qubitium in #2127
- [FIX] ovis2, compatible with transformers v4.57.1 by @ZX-ModelCloud in #2129
- fix cols padding by @LRL2-ModelCloud in #2130
- [FIX] ovis_1_6 quantization by @ZX-ModelCloud in #2131
- Minimax m2 by @Qubitium in #2128
- Fix awq marlin kernel for bf16 by @Qubitium in #2135
- [FIX] incorrect AWQ NODES by @ZX-ModelCloud in #2133
- add support_offload_to_disk check by @LRL2-ModelCloud in #2134
- Add Awq torch kernel by @Qubitium in #2137
- Marin by @Qubitium in #2139
- Marin scores by @Qubitium in #2141
- Fix triton version detection in nogil patcher by @amd-vlarakic in #2144
- Fix qwen2 omni by @LRL2-ModelCloud in #2140
- [MODEL] Add GraniteMoEHybrid by @ZX-ModelCloud in #2142
- Fold AWQ into proper Looper/Layer/Subset Lifecycle by @Qubitium in #2138
- Refine GPT-QModel description in README by @Qubitium in #2145
- fix device_map by @LRL2-ModelCloud in #2146
- [MODEL] Add Qwen3-VL by @techshoww in #2136
- Add calibration_concat_separator by @Qubitium in #2148
- add test_qwen3_vl.py by @LRL2-ModelCloud in #2147
- Fix triton monkeypatch by @Qubitium in #2149
- [MODEL] Add Brumby by @Qubitium in #2150
- Dedup/Cleanup by @Qubitium in #2151
- Prep for 5.2 release by @Qubitium in #2152
- Dedup3 by @Qubitium in #2153
- add missing file by @Qubitium in #2154
- GPTAQ rename by @Qubitium in #2155
- fix ci test by @Qubitium in #2158
- fix setup license by @Qubitium in #2160
- FIx snapshot_download receiving unsupported kwargs by @Qubitium in #2162
- Retry partial.to to fix accelerate invalid argument error for first moe layer for >4 GPU setups by @avtc in #2163
- Comments + Sync by @Qubitium in #2164
- Stats/Logs by @Qubitium in #2165
New Contributors
- @amd-vlarakic made their first contribution in #2144
- @techshoww made their first contribution in #2136
Full Changelog: v5.0.0...v5.2.0
GPT-QModel v5.0.0
Notable Changes:
- New Data-parallel quant support for MoE models on multi-gpu using
nogilPython (Python >= 3.13t withPYTHON_GIL=0env). - New
offload_to_disksupport enabled by default to massively reduce cpu ram usage. - New Intel optimized and Amd compatible
cpuhw acceleratedTorchFusedkernel. - Packing stage is now 4x faster and now inlined with quantization.
- Vram pressure for large models reduced during quantization.
act_group_awareis now 16k+ times faster and the default whendesc_act=Falsefor higher quality recovery without inference penalty ofdesc_act=True.- New beta quality AWQ support with full GEMM, GEMM_Fast, Marlin kernel support.
- New LFM, Ling, Qwen3 Omni model support.
- Bitblas kernel updated to support Bitblas 0.1.0.post1 reelase.
- Quantization is now faster with reduced vram usage. Enhanced logging support with LogBar.
- And much much more...
What's Changed
- rename
torch_dtypetodtypeto sync with hf transformers by @Qubitium in #1804 - drop support for python < 3.11 by @CSY-ModelCloud in #1805
- hard deprecated ipex in favor of torch_fused by @Qubitium in #1807
- update pyproject.toml by @CSY-ModelCloud in #1808
- [CI] release with 3.13t by @CSY-ModelCloud in #1811
- [QUANTIZATION] Add AWQ support by @ZX-ModelCloud in #1703
- find mapping by @LRL-ModelCloud in #1812
- Update README.md by @Qubitium in #1813
- Update version.py by @Qubitium in #1814
- Turtle in a half shell by @Qubitium in #1809
- note about memory saving by @Qubitium in #1817
- move fail_safe by @LRL-ModelCloud in #1818
- rename turtle method by @Qubitium in #1820
- add threads by @Qubitium in #1821
- remove AWQ mod defs by @ZX-ModelCloud in #1822
- [CI] use new docker by @CSY-ModelCloud in #1823
- Fix awq quantize by @LRL-ModelCloud in #1824
- [CI] use new docker for release source by @CSY-ModelCloud in #1825
- fix awq pack by @LRL-ModelCloud in #1826
- fix loading autoawq models and hf/vllm/sglang loading of newly awq qu… by @Qubitium in #1827
- wrong arg check by @Qubitium in #1828
- fix thread task var scoping by @Qubitium in #1829
- fix call param by @Qubitium in #1830
- fix threads > 1 not considered (unsafe) by @Qubitium in #1832
- cleanup by @Qubitium in #1833
- fix gptqmodel offload paths conflict by @Qubitium in #1834
- Ci test by @Qubitium in #1835
- eora: always diff in fp32 + cleanup by @Qubitium in #1836
- add register_buffer/parameter to NamedModule class by @Qubitium in #1837
- typo by @Qubitium in #1839
- add thread safety to all classes by @Qubitium in #1840
- fix fail_safe by @LRL-ModelCloud in #1844
- update marlin kernel by @ZX-ModelCloud in #1838
- fix fp32 reduce on/off by @Qubitium in #1845
- bypass marlin kernel bias issue by @Qubitium in #1846
- disable marlin atomics by default as it failed ci accuracy test by @Qubitium in #1847
- [FIX] awq marlin by @ZX-ModelCloud in #1816
- cleanup var names by @Qubitium in #1849
- pack per module by @LRL-ModelCloud in #1842
- [CI] use new docker by @CSY-ModelCloud in #1850
- tweak eora test by @Qubitium in #1851
- wait for thread tasks only when every module has completed. by @Qubitium in #1852
- [FIX] Compatible with vllm v0.10.2 by @ZX-ModelCloud in #1855
- move req.txt into toml by @CSY-ModelCloud in #1858
- do not create buffers only to overite them by @Qubitium in #1857
- pop states after use by @Qubitium in #1859
- [FIX] multiple "register_buffers" parameters by @ZX-ModelCloud in #1860
- Low memory pack by @Qubitium in #1861
- fix packing ci test by @Qubitium in #1862
- simplify by @Qubitium in #1853
- Fix 3bit packing regression in previous commit by @Qubitium in #1863
- remove deprecated
parallel_packingproperty by @Qubitium in #1864 - Fix qqq quant/offloading by @Qubitium in #1866
- temp disable awq gemm kernel due to failing ci by @Qubitium in #1867
- update vllm compat by @Qubitium in #1869
- fix regression by @Qubitium in #1870
- fix setup.py crashed because torch may not support float8_e8m0fnu by @CSY-ModelCloud in #1871
- [FIX] AwqGEMMQuantLinear skip gptq_v1 convert to v2 by @ZX-ModelCloud in #1872
- Fix awq gemm auto kernel selection order by @Qubitium in #1873
- Update README.md by @Qubitium in #1874
- reduce forwarding to minimal by @Qubitium in #1876
- Update README.md by @Qubitium in #1877
- fix exllama tests by @Qubitium in #1879
- debug print all params/buffers by @Qubitium in #1880
- skip internal loading of non-pkg compatible quantization models, i.e.… by @Qubitium in #1881
- Loader by @Qubitium in #1882
- Cleanup awq by @Qubitium in #1883
- remove broken test by @Qubitium in #1884
- [CI] remove old cuda/torch support for release by @CSY-ModelCloud in #1885
- fix loader by @LRL-ModelCloud in #1886
- fix nvcc warnings about pending cuda > 13.x compat by @Qubitium in #1887
- fix packing speed test by @Qubitium in #1889
- fix licenses warning by @CSY-ModelCloud in #1888
- set licenses to apache by @CSY-ModelCloud in #1890
- [FIX] AwqGEMMQuantLinear should is PackableQuantLinear by @ZX-ModelCloud in #1891
- skip modules that have no parameters and no buffers since they can't be offloaded by @LRL-ModelCloud in #1892
- skip modules that have no parameters and no buffers since they can't offload by @LRL-ModelCloud in #1894
- Fix device check by @Qubitium in #1896
- [CI] disable test install by @CSY-ModelCloud in #1895
- remove hash feature by @Qubitium in #1897
- fix cuda ext cannot be loaded by @Qubitium in #1898
- lock numpy to 2.2.6 by @CSY-ModelCloud in #1899
- [FIX] test_lm_eval.py by @ZX-ModelCloud in #1900
- Patch fix model save by @Qubitium in #1901
- Ugly patch save 2 by @Qubitium in #1902
- fix potential leak by @Qubitium in #1904
- [FIX] test_integration by @ZX-ModelCloud in #1903
- fix build will uploaded a empty wheel by @CSY-ModelCloud in #1905
- fix lm_head quant by @LRL-ModelCloud in #1906
- batch tweaks by @Qubitium in #1907
- [FIX] test_kernel_output_torch_fused by @ZX-ModelCloud in ...
GPT-QModel v4.2.5
What's Changed
- Cleanup hyb_act by @Qubitium in #1791
- Remove torch import in setup.py by @Qubitium in #1729
- Refractor: rename
hyb_acttoact_group_awareby @Qubitium in #1794 - Cleanup by @Qubitium in #1795, #1796
- [CI] Add torch 2.8.0 by @CSY-ModelCloud in #1797
- [CI] torch-2.6.0+cu128-python-3.9 does not exist by @CSY-ModelCloud in #1798
- Fix wf_unsqueeze_zero and wf_unsqueeze_neg_one by @LRL-ModelCloud in #1799
- GAR field save to meta on quant save by @Qubitium in #1800
- Add pyproject.toml by @CSY-ModelCloud in #1801
- [CI] Don't detect arch list when it has already been set & fix build-system requirments by @CSY-ModelCloud in #1802
Full Changelog: v4.2.0...v4.2.5
GPT-QModel v4.2.0
Notable Changes
- Add Qwen3-Next by @Qubitium and @LRL-ModelCloud in #1787
- Add Apertus support by @LRL-ModelCloud in #1767
- Add Kimi k2 support by @LRL-ModelCloud in #1768
- Add Klear support by @LRL-ModelCloud in #1769
- Add FastLLM support by @LRL-ModelCloud in #1771
- Add Nemotron H support by @LRL-ModelCloud in #1773
- Add
fail_safeoption by @LRL-ModelCloud in #1775 - Use threading lock to protect unsafe tensor moves in multi-gpu by @Qubitium in #1778
- Avoid building experimental extensions to reduce wheel size by @Qubitium in #1763
What's Changed
- Fix LlavaQwen2GPTQ by @LRL-ModelCloud in #1772
- Fix Q.to on multi-gpu gptq when proceeding fast and has many experts and gpus by @avtc in #1774
- Bump actions/setup-python from 5 to 6 in the github-actions group by @dependabot[bot] in #1758
- [CI] fix release jobs were skipped by @CSY-ModelCloud in #1759
- ignore compile warns about var declared but not used by @Qubitium in #1760
- allow prebuilt wheel path to be customized via env by @Qubitium in #1761
- add build toggles for all cpp kernels by @Qubitium in #1764
- fix multi gpu inference by @LRL-ModelCloud in #1762
- [CI] reduce wheel download size by @CSY-ModelCloud in #1765
- start 4.2.0-dev cycle by @Qubitium in #1766
- fix klear by @LRL-ModelCloud in #1770
- FIX transformers >= 4.56.1 force changed
torch.default_dtypeby @Qubitium in #1779 - fix multi gpu fail_safe by @LRL-ModelCloud in #1780
- fix device instance by @LRL-ModelCloud in #1783
- prepare for 4.2 release by @Qubitium in #1785
Full Changelog: v4.1.0...v4.2.0