-
Notifications
You must be signed in to change notification settings - Fork 684
[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) #4051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
fastdeploy/worker/worker_process.py
Outdated
| if quant_config_name == "wint8" and "Glm4Moe" in model_config.architectures[0]: | ||
| quantization_config["dense_quant_type"] = "wfp8afp8" | ||
| quantization_config["moe_quant_type"] = "wint8" | ||
| quantization_config["quantization"] = "mix_quant" | ||
| quant_config_name = "mix_quant" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个不能放到config.json里进行配置吗,前边ernie那个是特殊情况才这样写的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解config.json应该不能随意改吧,config里的quant配置应该和模型权重强相关。GLM是因为shape原因,dense无法走wint8。
合理的方法应该通过服务启动参数 --quantization 指定一个dict完成mix在线量化配置。
| if not self.quant_config.is_checkpoint_bf16: | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个直接reture是合理的吗?会存在这种情况吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cutlass的weight_only也加了这个判断。当前wint8都是在线量化,应该不会走到这个分支。如果有离线量化好的权重应该会走到这个分支
…8 triton_moe_backend) (PaddlePaddle#4051)
* [Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928) * support glm45_air * [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) (#4051) * check * fix v1 load for mix and wint8 * check --quantizations 'None' * check * support RL rollout * check v1 loader * check glm rollout_model, change wfp8afp8 per_token_cast_to_fp8 to native impl * check rollout moe gate begin layer_id * check rollout e_score_correction_bias * delete infer_to_train_mapping={} * code check
GLM-45-AIR Support wint8
Remove duplicate code about noaux_tc op.
Support WFP8AFP8LinearMethod and TritonWeightOnlyMoEMethod v1 loader