Skip to content

Conversation

@ckl117
Copy link
Collaborator

@ckl117 ckl117 commented Sep 23, 2025

CP from 2.2 PR#4115

  • Optimize per_token_quant_fp8 kernel performance improved by 50%
  • Support Wfp8Afp8MoEMethod(weight quant in channel-wise)
python -m fastdeploy.entrypoints.openai.api_server \
    --model ${model_path} \
    --max-model-len 32768 \
    --max-num-seqs 128 \
    --tensor-parallel-size 1 \
    --load_choices "default_v1" \
    --quantization wfp8afp8 \

@paddle-bot
Copy link

paddle-bot bot commented Sep 23, 2025

Thanks for your contribution!

@ckl117 ckl117 changed the title Dev moe wfp8afp8 [OPs] MoE support wfp8afp8(channelwise) and improve per_token_quant_fp8 Sep 24, 2025
@ckl117 ckl117 merged commit 7c1fd19 into PaddlePaddle:develop Sep 24, 2025
26 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants