-
Notifications
You must be signed in to change notification settings - Fork 682
Move create_parameters to __init__ in FuseMOE for CultassBackend and TritonBackend #3148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
| for name, tensor in name_tensor_map.items(): | ||
| create_and_set_parameter(layer, name, tensor) | ||
| getattr(layer, name).set_value(tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是process_prequanted_weights方法里的修改,对于这个离线量化的方式没有另外写一个load_prequanted_weights的方法,而是直接将原本的create_and_set_parameter修改为直接set_value。
这么改的原因是,我认为不管是离线量化还是在线量化,权重的shape都是一样的,初始化的时候,统一调用create_weight方法创建权重就可以了。
| ) | ||
| self.gate_correction_bias.set_value(gate_correction_bias_tensor) | ||
| else: | ||
| self.gate_correction_bias = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个None 是不是多余的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
测triton tensor_wise_fp8的时候这个变量是要变成None的,之前create_weight方法里是有提前创建了self.gate_correction_bias,所以这里要恢复成None,不然跑不通
| if getattr(self.fd_config.quant_config, "is_permuted", True): | ||
| self.quant_method.process_prequanted_weights(self, state_dict) | ||
| else: | ||
| self.gate_correction_bias = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
|
|
||
| if self.use_ep: | ||
| self.weight = self.create_parameter( | ||
| self.linear = self.create_parameter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为何改名为linear了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里按照彬涵建议改的,但感觉应该可以保留weight,在前面set_value的时候判断有没有linear,有就set linear的value,没有就set weight的value,cc@bukejiyu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉还是改回weight吧 我看了下现在的maping 改成linear ep在set权重的时候可能会中不到 paramter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| from fastdeploy.model_executor.layers.moe.fused_moe_cutlass_backend import ( | ||
| CutlassMoEMethod, | ||
| ) | ||
| from fastdeploy.model_executor.layers.moe.fused_moe_triton_backend import ( | ||
| BlockWiseFP8MoEMethod, | ||
| TensorWiseFP8MoEMethod, | ||
| TritonWeightOnlyMoEMethod, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不建议将这些Method挪到moe.py里,这会带来硬件耦合,当非gpu的硬件环境使用moe.py的时候,这里会有问题的吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这么做,主要是想只对triton和cutlass后端起作用,感觉可以另外写个文件把这个检测方法临时写一下,然后等全部的后端都接入之后,再把这个文件删掉。但想在moe里调用好像还是要import这个文件,还是会引入进来。。。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里不是可以直接使用self.quant_method吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
倒是也行,等全部后端接入后删了,不过你要加个todo标记一下,然后这里的import需要判断一下平台条件
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,我冲突解决后,再写一下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pcard-71500
当前MOE层创建权重的时机是在加载好safetensors文件并切分好后才创建的,根据模型loader改造的需求,需要把创建的时机前移到MOE层的初始化阶段。创建的参数除了MOE本身的权重外,还涉及到各种量化的参数创建。
本PR对MOE的Cultass与Triton Backend量化参数初始化方式进行了改造,在init阶段就将parameter创建好,load阶段只进行set_value。目前,只有以Cultss或Trition作为后端的量化方法才会走新的权重创建,其他后端仍然走原来的逻辑,以此维持框架的稳定性,同时避免大量的修改错误引入。
测试结果:
cutlass:
目前测试的bf16,wint4,win8, w4a8均可以与develop对齐
triton:
测试wint8,block_wise_fp8, tensor_wise_fp8均可以与develop对齐
8.6 merge后测试结果: merge develop后,cutlass的bf16, wint4, win8对齐,
w4a8执行错误,triton wint8对齐,两个fp8的量化都有问题(develop同)8.7 合入新的修复pr后,cultlass的w4a8对齐,triton的block_wise_fp8和tensor_wise_fp8也可以对齐
cutlass测试输出结果如下:
triton测试 (启动时设置FD_MOE_BACKEND="triton" 与 FD_USE_DEEP_GEMM=0确保走tritonbackend)输出结果如下:
测试方式为离线推理测试,(使用w4a8时使用TP4,其他均使用TP1),测试指令为: