Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend #3148

zeroRains · 2025-08-01T08:59:03Z

pcard-71500

当前MOE层创建权重的时机是在加载好safetensors文件并切分好后才创建的，根据模型loader改造的需求，需要把创建的时机前移到MOE层的初始化阶段。创建的参数除了MOE本身的权重外，还涉及到各种量化的参数创建。

本PR对MOE的Cultass与Triton Backend量化参数初始化方式进行了改造，在init阶段就将parameter创建好，load阶段只进行set_value。目前，只有以Cultss或Trition作为后端的量化方法才会走新的权重创建，其他后端仍然走原来的逻辑，以此维持框架的稳定性，同时避免大量的修改错误引入。

测试结果：

cutlass:
目前测试的bf16，wint4，win8, w4a8均可以与develop对齐

triton:
测试wint8，block_wise_fp8, tensor_wise_fp8均可以与develop对齐

8.6 merge后测试结果: merge develop后，cutlass的bf16, wint4, win8对齐，~~w4a8执行错误~~，triton wint8对齐，~~两个fp8的量化都有问题（develop同）~~
8.7 合入新的修复pr后，cultlass的w4a8对齐，triton的block_wise_fp8和tensor_wise_fp8也可以对齐

cutlass测试输出结果如下：

ERNIE-4.5-21B-A3B-Paddled tp1 + 各种quant
base no-q : ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you
pr   no-q : ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you

base wint8: ?\n\nI'm a language model AI, created by OpenAI. I'm designed to understand and generate human-like text based
pr   wint8: ?\n\nI'm a language model AI, created by OpenAI. I'm designed to understand and generate human-like text based

base wint4: ?\n有吗？\n\n你似乎在询问“你是谁？”或者“你有何身份？”，但在这个上下文中，我们并没有一个具体的
pr   wint4: ?\n有吗？\n\n你似乎在询问“你是谁？”或者“你有何身份？”，但在这个上下文中，我们并没有一个具体的

ERNIE-4.5-21B-A3B-Paddled ep+dp4 + 各种quant
base wint8: ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you
pr.  wint8: ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you
base wint4: ?\n有吗？\n\n你似乎在询问“你是谁？”或者“你有何身份？”，但在这个上下文中，我们并没有一个具体的
pr   wint4: ?\n有吗？\n\n你似乎在询问“你是谁？”或者“你有何身份？”，但在这个上下文中，我们并没有一个具体的

ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle tp4 
base w4a8 : ?\n\nI'm a software engineer and a musician. I've been working in the tech industry for over 15 years,
pr   w4a8 : ?\n\nI'm a software engineer and a musician. I've been working in the tech industry for over 15 years,

triton测试 (启动时设置FD_MOE_BACKEND="triton" 与 FD_USE_DEEP_GEMM=0确保走tritonbackend)输出结果如下：

ERNIE-4.5-21B-A3B-Paddled tp1 + 各种quant
base wint8    : ?\n\nI'm a passionate developer with a background in computer science and a keen interest in artificial intelligence and machine learning. I love exploring
pr   wint8    : ?\n\nI'm a passionate developer with a background in computer science and a keen interest in artificial intelligence and machine learning. I love exploring
base block_fp8: ?\n\nI'm a software engineer with a passion for building scalable and efficient systems. I specialize in backend development, particularly with Python
pr   block_fp8: ?\n\nI'm a software engineer with a passion for building scalable and efficient systems. I specialize in backend development, particularly with Python

ERNIE-45-Turbo-fp8 tp8
base tesnor_fp8: ?\n\nI'm a software engineer, a maker, a tinkerer, a photographer, a cyclist, a husband, a father
pr   tesnor_fp8: ?\n\nI'm a software engineer, a maker, a tinkerer, a photographer, a cyclist, a husband, a father

测试方式为离线推理测试，（使用w4a8时使用TP4，其他均使用TP1），测试指令为：

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/Models/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle" #w4a8 
# model_name_or_path = "/root/paddlejob/workspace/env_run/output/Models/ERNIE-4.5-21B-A3B-Paddled"

sampling_params = SamplingParams(temperature=0.1, max_tokens=30, top_p=0)
#  quantization="wint4" , "block_wise_fp8", "w4a8", "wint8"
llm = LLM(model=model_name_or_path, tensor_parallel_size=4, quantization="w4a8")
output = llm.generate(prompts="who are you",
                      use_tqdm=True,
                      sampling_params=sampling_params)
print(output)

paddle-bot · 2025-08-01T08:59:07Z

Thanks for your contribution!

zeroRains · 2025-08-02T08:03:18Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py

        for name, tensor in name_tensor_map.items():
-            create_and_set_parameter(layer, name, tensor)
+            getattr(layer, name).set_value(tensor)


这里是process_prequanted_weights方法里的修改，对于这个离线量化的方式没有另外写一个load_prequanted_weights的方法，而是直接将原本的create_and_set_parameter修改为直接set_value。

这么改的原因是，我认为不管是离线量化还是在线量化，权重的shape都是一样的，初始化的时候，统一调用create_weight方法创建权重就可以了。

…into create_w

bukejiyu · 2025-08-06T07:21:42Z

fastdeploy/model_executor/layers/moe/moe.py

-                )
                self.gate_correction_bias.set_value(gate_correction_bias_tensor)
+            else:
+                self.gate_correction_bias = None


这个None 是不是多余的

测triton tensor_wise_fp8的时候这个变量是要变成None的，之前create_weight方法里是有提前创建了self.gate_correction_bias，所以这里要恢复成None，不然跑不通

bukejiyu · 2025-08-06T07:24:23Z

fastdeploy/model_executor/layers/moe/moe.py

-            if getattr(self.fd_config.quant_config, "is_permuted", True):
-                self.quant_method.process_prequanted_weights(self, state_dict)
+        else:
+            self.gate_correction_bias = None


YuanRisheng · 2025-08-06T06:54:10Z

fastdeploy/model_executor/layers/lm_head.py


        if self.use_ep:
-            self.weight = self.create_parameter(
+            self.linear = self.create_parameter(


这里为何改名为linear了

这里按照彬涵建议改的，但感觉应该可以保留weight，在前面set_value的时候判断有没有linear，有就set linear的value，没有就set weight的value，cc@bukejiyu

感觉还是改回weight吧我看了下现在的maping 改成linear ep在set权重的时候可能会中不到 paramter

YuanRisheng · 2025-08-06T07:31:34Z

fastdeploy/model_executor/layers/moe/moe.py

+from fastdeploy.model_executor.layers.moe.fused_moe_cutlass_backend import (
+    CutlassMoEMethod,
+)
+from fastdeploy.model_executor.layers.moe.fused_moe_triton_backend import (
+    BlockWiseFP8MoEMethod,
+    TensorWiseFP8MoEMethod,
+    TritonWeightOnlyMoEMethod,
+)


这里不建议将这些Method挪到moe.py里，这会带来硬件耦合，当非gpu的硬件环境使用moe.py的时候，这里会有问题的吧

这么做，主要是想只对triton和cutlass后端起作用，感觉可以另外写个文件把这个检测方法临时写一下，然后等全部的后端都接入之后，再把这个文件删掉。但想在moe里调用好像还是要import这个文件，还是会引入进来。。。

这里不是可以直接使用self.quant_method吗

倒是也行，等全部后端接入后删了，不过你要加个todo标记一下，然后这里的import需要判断一下平台条件

OK，我冲突解决后，再写一下

…into create_w

zeroRains added 2 commits August 1, 2025 16:04

w4a8 bug

66021d5

fix w4a8 bug

0dc28f5

zeroRains changed the title ~~move create_parameters to __init__ in FuseMOE for CultassBackend~~ Move create_parameters to __init__ in FuseMOE for CultassBackend Aug 1, 2025

zeroRains added 2 commits August 1, 2025 17:30

remove code

a333b67

fix conflict

ca328ea

zeroRains commented Aug 2, 2025

View reviewed changes

zeroRains added 2 commits August 3, 2025 14:59

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

9b93150

…into create_w

modify the triton backend

87720d1

zeroRains changed the title ~~Move create_parameters to __init__ in FuseMOE for CultassBackend~~ Move create_parameters to __init__ in FuseMOE for CultassBackend and TritonBackend Aug 3, 2025

zeroRains added 7 commits August 4, 2025 19:21

Merge branch 'develop' into create_w

a63bd3f

fix ep

ac92a1f

fix conflict

ea64127

fix the bug with tensor_wise_fp8 in triton backend

85ef923

fix confict

821bea1

fix the RL

d4254f0

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

26ec94e

…into create_w

bukejiyu reviewed Aug 6, 2025

View reviewed changes

YuanRisheng reviewed Aug 6, 2025

View reviewed changes

fix conflict

e53328f

zeroRains force-pushed the create_w branch from f712ea4 to e53328f Compare August 6, 2025 10:38

zeroRains added 5 commits August 7, 2025 00:45

fix bug by merge

24477b9

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

a15a39f

…into create_w

fix the bug in w4a8

0cef3d3

fix the tensor_wise_fp8 bug

fcd1233

fix RL

f75f032

YuanRisheng approved these changes Aug 7, 2025

View reviewed changes

Jiang-Jia-Jun merged commit ce1f353 into PaddlePaddle:develop Aug 8, 2025
11 of 14 checks passed

zeroRains deleted the create_w branch August 8, 2025 09:01

Move create_parameters to __init__ in FuseMOE for CultassBackend and TritonBackend #3148

Move create_parameters to __init__ in FuseMOE for CultassBackend and TritonBackend #3148

Uh oh!

Conversation

zeroRains commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Aug 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroRains Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeroRains Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend #3148

Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend #3148

zeroRains commented Aug 1, 2025 •

edited

Loading

zeroRains Aug 6, 2025 •

edited

Loading

zeroRains Aug 6, 2025 •

edited

Loading