-
Notifications
You must be signed in to change notification settings - Fork 138
Add QQQ #1402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Qubitium <[email protected]>
Signed-off-by: Qubitium <[email protected]>
|
Hi @HandH1998, I am officially adding QQQ to GPTQModel. This should allow QQQ to enjoy/share all the gptqmodel supported models and aux features. For now, only the model loading and inference works. Will move to quantization next. Is it possible for you to write a simple QQQ kernel in torch? This would serve as a fallback to support all hw platforms, not just Ampere+. You can just copy the existing Feel free contact me on SGLang slack (I see you are also an active-contributor there!) and X qubitium. |
Signed-off-by: Qubitium <[email protected]>
Signed-off-by: Qubitium <[email protected]>
|
@Qubitium Thanks for supporting QQQ in GPTQModel. As QQQ needs to shuffle the weights offline, the shuffled weights are not supported by torch. If we want to run it in torch, we need to convert the weights to the normal format online, which will cost a lot of time. Do you think it is OK? |
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
I see the problem. Can the conversion be one-time cost in the module init/post_init or this is conversion needs to happen at every forward pass? If one-time conersion is required, I think this is a acceptable cost. We can also save the unshuffed state as a different I guess torch is a nice kernel to have so people can use it, in contrast to the Marlin kernel, as a stepping stone to spawn off other kernels for the HQQ format for different hardware. But if there is too much work required, we don't need to have it. No worries. We will work on the quant part to get the the quantization plumbing connected. first. If you have time and you think it's worth it, that be done later. I have invited you to the repo collaborators so you can push to this branch or other when you see fit. |
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
|
@HandH1998 Both quantize (groupsize -1 and 128) and inference code in working state. I have two questions:
Thanks. |
|
@Qubitium
|
If rotation is enabled, does the modeling code in vllm/sglang (for example) need to be modified to run rotated QQQ? I don't see the |
Signed-off-by: ZX-ModelCloud <[email protected]>
|
@Qubitium We only employ rotation onfline, which means that we fuse the rotation matrix into the linear weight. So it doesn't need to change the code for inference. |
Signed-off-by: ZX-ModelCloud <[email protected]>
Ok. Thank! We will add rotate for HQQ in our next PR and only enly enable it for the Model that has been validated for rotation such as Llama 2 and Qwen 2. |
TODO:
BACKEND.QQQ,QUANT_METHOD.QQQ,FORMAT.QQQRef: https://github.com/HandH1998/QQQ