[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) #4051

ckl117 · 2025-09-10T11:43:00Z

GLM-45-AIR Support wint8
Remove duplicate code about noaux_tc op.
Support WFP8AFP8LinearMethod and TritonWeightOnlyMoEMethod v1 loader

export FD_MOE_BACKEND=triton

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 33 \
    --tensor-parallel-size 4 \
    --load_choices "default_v1" \
    --quantization '{"quantization":"mix_quant","dense_quant_type":"wfp8afp8","moe_quant_type":"wint8"}' \

…into glm45_air

…into glm45_air_wint8

paddle-bot · 2025-09-10T11:43:06Z

Thanks for your contribution!

YuanRisheng · 2025-09-10T13:31:32Z

fastdeploy/worker/worker_process.py

+        if quant_config_name == "wint8" and "Glm4Moe" in model_config.architectures[0]:
+            quantization_config["dense_quant_type"] = "wfp8afp8"
+            quantization_config["moe_quant_type"] = "wint8"
+            quantization_config["quantization"] = "mix_quant"
+            quant_config_name = "mix_quant"


这个不能放到config.json里进行配置吗，前边ernie那个是特殊情况才这样写的

我理解config.json应该不能随意改吧，config里的quant配置应该和模型权重强相关。GLM是因为shape原因，dense无法走wint8。
合理的方法应该通过服务启动参数 --quantization 指定一个dict完成mix在线量化配置。

yuanlehome · 2025-09-11T05:53:15Z

fastdeploy/model_executor/layers/moe/fused_moe_triton_backend.py

+        if not self.quant_config.is_checkpoint_bf16:
+            return


这个直接reture是合理的吗？会存在这种情况吗？

cutlass的weight_only也加了这个判断。当前wint8都是在线量化，应该不会走到这个分支。如果有离线量化好的权重应该会走到这个分支

…8 triton_moe_backend) (PaddlePaddle#4051)

* [Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928) * support glm45_air * [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) (#4051) * check * fix v1 load for mix and wint8 * check --quantizations 'None' * check * support RL rollout * check v1 loader * check glm rollout_model, change wfp8afp8 per_token_cast_to_fp8 to native impl * check rollout moe gate begin layer_id * check rollout e_score_correction_bias * delete infer_to_train_mapping={} * code check

ckl117 added 12 commits September 6, 2025 00:17

support glm45_air

27c53b9

check

29e2f21

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

9c7722d

…into glm45_air

check batch_id_per_token

c30f6cf

check

2947eab

add e2e test and clean code

4069148

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

97d4797

…into glm45_air

code check and test.py tp1

dd13fe4

check test

a023d90

check

b9aa1a0

support glm45_air wint8, dense channelwise wfp8afp8, moe wint8 triton

63f96bd

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

2960ec2

…into glm45_air_wint8

ckl117 added 2 commits September 10, 2025 12:00

revise test

c3a3597

delete log

5946e75

ckl117 changed the title ~~[Feature] GLM-45-AIR Support wint8~~ [Feature] GLM-45-AIR Support wint8(Dense wfp8afp8 and wint8 triton_moe_backend) Sep 10, 2025

YuanRisheng reviewed Sep 10, 2025

View reviewed changes

yuanlehome mentioned this pull request Sep 11, 2025

[NewFeture]add ep rollout model init and update/clear ep buffer #3927

Merged

support json.loads(quantization) and delete v0 code in glm4_moe.py

190a9db

ckl117 changed the title ~~[Feature] GLM-45-AIR Support wint8(Dense wfp8afp8 and wint8 triton_moe_backend)~~ [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) Sep 11, 2025

yuanlehome reviewed Sep 11, 2025

View reviewed changes

yuanlehome previously approved these changes Sep 11, 2025

View reviewed changes

Adapt to RolloutModelConfig

29c2a45

ckl117 dismissed yuanlehome’s stale review via 29c2a45 September 11, 2025 07:35

check

b8090c4

YuanRisheng approved these changes Sep 11, 2025

View reviewed changes

ckl117 merged commit 4859f40 into PaddlePaddle:develop Sep 11, 2025
55 of 72 checks passed

ckl117 added a commit to ckl117/FastDeploy that referenced this pull request Sep 11, 2025

[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint…

190104e

…8 triton_moe_backend) (PaddlePaddle#4051)

This was referenced Sep 11, 2025

[CP] Glm45 air 2.2 #4072

Closed

[CP]Glm45 air 2.2 #4073

Merged

yuanlehome mentioned this pull request Sep 12, 2025

[v1 loader]qwen Offline fp8 #4036

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) #4051

[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) #4051

Uh oh!

ckl117 commented Sep 10, 2025 •

edited by yuanlehome

Loading

Uh oh!

paddle-bot bot commented Sep 10, 2025

Uh oh!

YuanRisheng Sep 10, 2025

Uh oh!

ckl117 Sep 10, 2025 •

edited

Loading

Uh oh!

yuanlehome Sep 11, 2025

Uh oh!

ckl117 Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) #4051

[Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) #4051

Uh oh!

Conversation

ckl117 commented Sep 10, 2025 • edited by yuanlehome Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Sep 10, 2025

Uh oh!

YuanRisheng Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

ckl117 Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuanlehome Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

ckl117 Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ckl117 commented Sep 10, 2025 •

edited by yuanlehome

Loading

ckl117 Sep 10, 2025 •

edited

Loading