Skip to content

Conversation

@ckl117
Copy link
Collaborator

@ckl117 ckl117 commented Sep 15, 2025

  • Fix noaux_tc cuda Error 700 in CUDAGraph(GLM45-Air)
    -- Add input renormalize to specify whether topk normalization is required in noaux_tc op.
    -- Support stable sort

  • Optimize per_token_quant_fp8 kernel performance improved by 50%

  • Support Wfp8Afp8MoEMethod(weight quant in channel-wise)

python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 128 \
    --tensor-parallel-size 1 \
    --load_choices "default_v1" \
    --quantization wfp8afp8 \

@paddle-bot
Copy link

paddle-bot bot commented Sep 15, 2025

Thanks for your contribution!

@ckl117 ckl117 changed the title Optimize per_token_quant_fp8 op Optimize per_token_quant_fp8 OP and Support Wfp8Afp8MoEMethod Sep 18, 2025
@ckl117 ckl117 changed the title Optimize per_token_quant_fp8 OP and Support Wfp8Afp8MoEMethod Fix group_topk cuda Error 700 in CUDAGraph and Add wfp8apf8 moe quant method Sep 19, 2025
@ckl117 ckl117 changed the title Fix group_topk cuda Error 700 in CUDAGraph and Add wfp8apf8 moe quant method Fix noaux_tc cuda Error 700 in CUDAGraph and Add wfp8apf8 moe quant method Sep 19, 2025
Copy link
Collaborator

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后续增强下单测对出现bug case 的覆盖

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit f38b174 into PaddlePaddle:release/2.2 Sep 22, 2025
23 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants