Skip to content

Conversation

@ckl117
Copy link
Collaborator

@ckl117 ckl117 commented Sep 10, 2025

GLM-45-AIR Support wint8
Remove duplicate code about noaux_tc op.
Support WFP8AFP8LinearMethod and TritonWeightOnlyMoEMethod v1 loader

export FD_MOE_BACKEND=triton

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 33 \
    --tensor-parallel-size 4 \
    --load_choices "default_v1" \
    --quantization '{"quantization":"mix_quant","dense_quant_type":"wfp8afp8","moe_quant_type":"wint8"}' \

@paddle-bot
Copy link

paddle-bot bot commented Sep 10, 2025

Thanks for your contribution!

@ckl117 ckl117 changed the title [Feature] GLM-45-AIR Support wint8 [Feature] GLM-45-AIR Support wint8(Dense wfp8afp8 and wint8 triton_moe_backend) Sep 10, 2025
Comment on lines 730 to 734
if quant_config_name == "wint8" and "Glm4Moe" in model_config.architectures[0]:
quantization_config["dense_quant_type"] = "wfp8afp8"
quantization_config["moe_quant_type"] = "wint8"
quantization_config["quantization"] = "mix_quant"
quant_config_name = "mix_quant"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不能放到config.json里进行配置吗,前边ernie那个是特殊情况才这样写的

Copy link
Collaborator Author

@ckl117 ckl117 Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解config.json应该不能随意改吧,config里的quant配置应该和模型权重强相关。GLM是因为shape原因,dense无法走wint8。
合理的方法应该通过服务启动参数 --quantization 指定一个dict完成mix在线量化配置。

@ckl117 ckl117 changed the title [Feature] GLM-45-AIR Support wint8(Dense wfp8afp8 and wint8 triton_moe_backend) [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) Sep 11, 2025
Comment on lines +183 to +184
if not self.quant_config.is_checkpoint_bf16:
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个直接reture是合理的吗?会存在这种情况吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cutlass的weight_only也加了这个判断。当前wint8都是在线量化,应该不会走到这个分支。如果有离线量化好的权重应该会走到这个分支

yuanlehome
yuanlehome previously approved these changes Sep 11, 2025
@ckl117 ckl117 merged commit 4859f40 into PaddlePaddle:develop Sep 11, 2025
55 of 72 checks passed
ckl117 added a commit to ckl117/FastDeploy that referenced this pull request Sep 11, 2025
This was referenced Sep 11, 2025
qingqing01 pushed a commit that referenced this pull request Sep 15, 2025
* [Feature] Support zai-org/GLM-4.5-Air BF16 model (#3928)

* support glm45_air

* [Feature] GLM-45-AIR Support Mix Quantization(Dense wfp8afp8 and wint8 triton_moe_backend) (#4051)

* check

* fix v1 load for mix and wint8

* check --quantizations 'None'

* check

* support RL rollout

* check v1 loader

* check glm rollout_model, change wfp8afp8 per_token_cast_to_fp8 to native impl

* check rollout moe gate begin layer_id

* check rollout e_score_correction_bias

* delete infer_to_train_mapping={}

* code check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants