Skip to content

Conversation

@Qubitium
Copy link
Collaborator

@Qubitium Qubitium commented Mar 9, 2025

TODO:

  • Add QQQ to QuantizeConfig
  • Add BACKEND.QQQ, QUANT_METHOD.QQQ, FORMAT.QQQ
  • Compile QQQ
  • Load QQQ
  • Fix validation. QQQ only supports very limited in/out features tuples
  • Quantize
  • Quantize + Save + Load
  • Rotation

Ref: https://github.com/HandH1998/QQQ

Qubitium added 2 commits March 9, 2025 11:16
Signed-off-by: Qubitium <[email protected]>
Signed-off-by: Qubitium <[email protected]>
@Qubitium Qubitium marked this pull request as draft March 9, 2025 11:18
@Qubitium
Copy link
Collaborator Author

Qubitium commented Mar 9, 2025

Hi @HandH1998, I am officially adding QQQ to GPTQModel. This should allow QQQ to enjoy/share all the gptqmodel supported models and aux features. For now, only the model loading and inference works. Will move to quantization next.

Load testing passed: https://github.com/ModelCloud/GPTQModel/pull/1402/files#diff-b62c27281879ee8ee111b40ab6604c8828ec77f5ff13fe87684c705827499bde

Is it possible for you to write a simple QQQ kernel in torch? This would serve as a fallback to support all hw platforms, not just Ampere+. You can just copy the existing TorchQuantLInear and rename it to TorchQQQQuantLInear. Thanks!

Feel free contact me on SGLang slack (I see you are also an active-contributor there!) and X qubitium.

Qubitium added 2 commits March 9, 2025 11:47
Signed-off-by: Qubitium <[email protected]>
Signed-off-by: Qubitium <[email protected]>
@HandH1998
Copy link
Collaborator

@Qubitium Thanks for supporting QQQ in GPTQModel. As QQQ needs to shuffle the weights offline, the shuffled weights are not supported by torch. If we want to run it in torch, we need to convert the weights to the normal format online, which will cost a lot of time. Do you think it is OK?

@Qubitium
Copy link
Collaborator Author

@Qubitium Thanks for supporting QQQ in GPTQModel. As QQQ needs to shuffle the weights offline, the shuffled weights are not supported by torch. If we want to run it in torch, we need to convert the weights to the normal format online, which will cost a lot of time. Do you think it is OK?

I see the problem. Can the conversion be one-time cost in the module init/post_init or this is conversion needs to happen at every forward pass? If one-time conersion is required, I think this is a acceptable cost. We can also save the unshuffed state as a different checkpoint_format so there no runtime cost, if the cost is too great.

I guess torch is a nice kernel to have so people can use it, in contrast to the Marlin kernel, as a stepping stone to spawn off other kernels for the HQQ format for different hardware. But if there is too much work required, we don't need to have it.

No worries. We will work on the quant part to get the the quantization plumbing connected. first. If you have time and you think it's worth it, that be done later. I have invited you to the repo collaborators so you can push to this branch or other when you see fit.

Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
@Qubitium
Copy link
Collaborator Author

Qubitium commented Mar 10, 2025

@HandH1998 Both quantize (groupsize -1 and 128) and inference code in working state.

I have two questions:

  1. Smoothing preprocess: we did not add this. in code comments you mentioned smoothing does not generate better models?

  2. Rotation: Also not added. Can you explain this a bit? How does it improve model accuracy?

Thanks.

@HandH1998
Copy link
Collaborator

@Qubitium
Answer for your two questions:

  1. Smoothing improves the model performance a little when working with GPTQ. If adding this will cost much time, I also don't think you should do this now.
  2. Rotation is really good at some models like LLaMA-2-series and Qwen2-Series. Combining rotation with GPTQ is also the thing Quarot does. But rotation will make the some models collapse like the cases you can find in my QQQ repo. Despite this, I still think you should add this as a choice for users.

@Qubitium
Copy link
Collaborator Author

Qubitium commented Mar 11, 2025

@HandH1998

  1. Rotation is really good at some models like LLaMA-2-series and Qwen2-Series. Combining rotation with GPTQ is also the thing Quarot does. But rotation will make the some models collapse like the cases you can find in my QQQ repo. Despite this, I still think you should add this as a choice for users.

If rotation is enabled, does the modeling code in vllm/sglang (for example) need to be modified to run rotated QQQ? I don't see the rotation property stored in the post-quantize config so it appears, based on config, that rotation does not need changes to modeling code but I see that for rotation in qunatize stage, the layer norsm are fused so that means modeling code needs changing for inference too?

@HandH1998
Copy link
Collaborator

@Qubitium We only employ rotation onfline, which means that we fuse the rotation matrix into the linear weight. So it doesn't need to change the code for inference.

@Qubitium Qubitium marked this pull request as ready for review March 11, 2025 06:18
@Qubitium
Copy link
Collaborator Author

@Qubitium We only employ rotation onfline, which means that we fuse the rotation matrix into the linear weight. So it doesn't need to change the code for inference.

Ok. Thank! We will add rotate for HQQ in our next PR and only enly enable it for the Model that has been validated for rotation such as Llama 2 and Qwen 2.

@Qubitium Qubitium merged commit 26ae13e into main Mar 11, 2025
4 checks passed
@Qubitium Qubitium deleted the qqq branch March 11, 2025 06:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants