Skip to content

Conversation

@zeroRains
Copy link
Contributor

@zeroRains zeroRains commented Aug 1, 2025

pcard-71500

当前MOE层创建权重的时机是在加载好safetensors文件并切分好后才创建的,根据模型loader改造的需求,需要把创建的时机前移到MOE层的初始化阶段。创建的参数除了MOE本身的权重外,还涉及到各种量化的参数创建。

本PR对MOE的Cultass与Triton Backend量化参数初始化方式进行了改造,在init阶段就将parameter创建好,load阶段只进行set_value。目前,只有以Cultss或Trition作为后端的量化方法才会走新的权重创建,其他后端仍然走原来的逻辑,以此维持框架的稳定性,同时避免大量的修改错误引入。

测试结果:

cutlass:
目前测试的bf16,wint4,win8, w4a8均可以与develop对齐

triton:
测试wint8,block_wise_fp8, tensor_wise_fp8均可以与develop对齐

8.6 merge后测试结果: merge develop后,cutlass的bf16, wint4, win8对齐,w4a8执行错误,triton wint8对齐,两个fp8的量化都有问题(develop同)
8.7 合入新的修复pr后,cultlass的w4a8对齐,triton的block_wise_fp8和tensor_wise_fp8也可以对齐

cutlass测试输出结果如下:

ERNIE-4.5-21B-A3B-Paddled tp1 + 各种quant
base no-q : ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you
pr   no-q : ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you

base wint8: ?\n\nI'm a language model AI, created by OpenAI. I'm designed to understand and generate human-like text based
pr   wint8: ?\n\nI'm a language model AI, created by OpenAI. I'm designed to understand and generate human-like text based

base wint4: ?\n有吗?\n\n你似乎在询问“你是谁?”或者“你有何身份?”,但在这个上下文中,我们并没有一个具体的
pr   wint4: ?\n有吗?\n\n你似乎在询问“你是谁?”或者“你有何身份?”,但在这个上下文中,我们并没有一个具体的

ERNIE-4.5-21B-A3B-Paddled ep+dp4 + 各种quant
base wint8: ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you
pr.  wint8: ?\n\nI'm a language model AI, created by OpenAI. My purpose is to assist and engage in conversation with users like you
base wint4: ?\n有吗?\n\n你似乎在询问“你是谁?”或者“你有何身份?”,但在这个上下文中,我们并没有一个具体的
pr   wint4: ?\n有吗?\n\n你似乎在询问“你是谁?”或者“你有何身份?”,但在这个上下文中,我们并没有一个具体的

ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle tp4 
base w4a8 : ?\n\nI'm a software engineer and a musician. I've been working in the tech industry for over 15 years,
pr   w4a8 : ?\n\nI'm a software engineer and a musician. I've been working in the tech industry for over 15 years,

triton测试 (启动时设置FD_MOE_BACKEND="triton" 与 FD_USE_DEEP_GEMM=0确保走tritonbackend)输出结果如下:

ERNIE-4.5-21B-A3B-Paddled tp1 + 各种quant
base wint8    : ?\n\nI'm a passionate developer with a background in computer science and a keen interest in artificial intelligence and machine learning. I love exploring
pr   wint8    : ?\n\nI'm a passionate developer with a background in computer science and a keen interest in artificial intelligence and machine learning. I love exploring
base block_fp8: ?\n\nI'm a software engineer with a passion for building scalable and efficient systems. I specialize in backend development, particularly with Python
pr   block_fp8: ?\n\nI'm a software engineer with a passion for building scalable and efficient systems. I specialize in backend development, particularly with Python

ERNIE-45-Turbo-fp8 tp8
base tesnor_fp8: ?\n\nI'm a software engineer, a maker, a tinkerer, a photographer, a cyclist, a husband, a father
pr   tesnor_fp8: ?\n\nI'm a software engineer, a maker, a tinkerer, a photographer, a cyclist, a husband, a father

测试方式为离线推理测试,(使用w4a8时使用TP4,其他均使用TP1),测试指令为:

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/Models/ERNIE-4.5-300B-A47B-W4A8C8-TP4-Paddle" #w4a8 
# model_name_or_path = "/root/paddlejob/workspace/env_run/output/Models/ERNIE-4.5-21B-A3B-Paddled"

sampling_params = SamplingParams(temperature=0.1, max_tokens=30, top_p=0)
#  quantization="wint4" , "block_wise_fp8", "w4a8", "wint8"
llm = LLM(model=model_name_or_path, tensor_parallel_size=4, quantization="w4a8")
output = llm.generate(prompts="who are you",
                      use_tqdm=True,
                      sampling_params=sampling_params)
print(output)

@paddle-bot
Copy link

paddle-bot bot commented Aug 1, 2025

Thanks for your contribution!

@zeroRains zeroRains changed the title move create_parameters to __init__ in FuseMOE for CultassBackend Move create_parameters to __init__ in FuseMOE for CultassBackend Aug 1, 2025
Comment on lines 655 to +656
for name, tensor in name_tensor_map.items():
create_and_set_parameter(layer, name, tensor)
getattr(layer, name).set_value(tensor)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是process_prequanted_weights方法里的修改,对于这个离线量化的方式没有另外写一个load_prequanted_weights的方法,而是直接将原本的create_and_set_parameter修改为直接set_value。

这么改的原因是,我认为不管是离线量化还是在线量化,权重的shape都是一样的,初始化的时候,统一调用create_weight方法创建权重就可以了。

@zeroRains zeroRains changed the title Move create_parameters to __init__ in FuseMOE for CultassBackend Move create_parameters to __init__ in FuseMOE for CultassBackend and TritonBackend Aug 3, 2025
)
self.gate_correction_bias.set_value(gate_correction_bias_tensor)
else:
self.gate_correction_bias = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个None 是不是多余的

Copy link
Contributor Author

@zeroRains zeroRains Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

测triton tensor_wise_fp8的时候这个变量是要变成None的,之前create_weight方法里是有提前创建了self.gate_correction_bias,所以这里要恢复成None,不然跑不通

if getattr(self.fd_config.quant_config, "is_permuted", True):
self.quant_method.process_prequanted_weights(self, state_dict)
else:
self.gate_correction_bias = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上


if self.use_ep:
self.weight = self.create_parameter(
self.linear = self.create_parameter(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为何改名为linear了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里按照彬涵建议改的,但感觉应该可以保留weight,在前面set_value的时候判断有没有linear,有就set linear的value,没有就set weight的value,cc@bukejiyu

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉还是改回weight吧 我看了下现在的maping 改成linear ep在set权重的时候可能会中不到 paramter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 22 to 29
from fastdeploy.model_executor.layers.moe.fused_moe_cutlass_backend import (
CutlassMoEMethod,
)
from fastdeploy.model_executor.layers.moe.fused_moe_triton_backend import (
BlockWiseFP8MoEMethod,
TensorWiseFP8MoEMethod,
TritonWeightOnlyMoEMethod,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不建议将这些Method挪到moe.py里,这会带来硬件耦合,当非gpu的硬件环境使用moe.py的时候,这里会有问题的吧

Copy link
Contributor Author

@zeroRains zeroRains Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这么做,主要是想只对triton和cutlass后端起作用,感觉可以另外写个文件把这个检测方法临时写一下,然后等全部的后端都接入之后,再把这个文件删掉。但想在moe里调用好像还是要import这个文件,还是会引入进来。。。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是可以直接使用self.quant_method吗

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

倒是也行,等全部后端接入后删了,不过你要加个todo标记一下,然后这里的import需要判断一下平台条件

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,我冲突解决后,再写一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit ce1f353 into PaddlePaddle:develop Aug 8, 2025
11 of 14 checks passed
@zeroRains zeroRains deleted the create_w branch August 8, 2025 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants