-
Notifications
You must be signed in to change notification settings - Fork 684
[Iluvatar GPU] Optimze attention and moe performance #3234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
863a582 to
9516b2d
Compare
|
Thanks for your contribution! |
a315aa2 to
9748bd1
Compare
1a8a265 to
5cb095b
Compare
5cb095b to
c9a0ffd
Compare
tianshuo78520a
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM CI
yongqiangma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
DDDivano
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
|
|
||
| class IluvatarWorker(WorkerBase): | ||
| class IluvatarWorker(GpuWorker): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iluvatar 的执行流程为什么要跟 gpu 耦合在一起?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iluvatar 的执行流程为什么要跟 gpu 耦合在一起?
这次从0630升级到最新commit发现gpu_model_runner改变很大,而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的,除了import部分算子有区别,其他都是一样的。这次这样改动,以后在升级适配可以直接服用gpu_model_runner的流程,避免在做copy的重复工作,如果出现了不兼容的流程,我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iluvatar 的执行流程为什么要跟 gpu 耦合在一起?
这次从0630升级到最新commit发现gpu_model_runner改变很大,而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的,除了import部分算子有区别,其他都是一样的。这次这样改动,以后在升级适配可以直接服用gpu_model_runner的流程,避免在做copy的重复工作,如果出现了不兼容的流程,我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work
好的,6.30之后执行流程一直在快速迭代,等执行器稳定后不同硬件的执行器需要重新隔离开
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok,等稳定了在隔开
| def initialize_cache(self, num_gpu_blocks: int) -> None: | ||
| """ """ | ||
| self.model_runner.update_share_input_block_num(num_gpu_blocks=num_gpu_blocks) | ||
| class IluvatarPaddleDisWorkerProc(PaddleDisWorkerProc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worker proc 不应该放在 worker 层级,这里留个 TODO 项吧。之前的没有预留支持多个 worker proc 的接口, 待FastDeploy 的执行器重构后能架构会更合理一些。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,等重构后我再改下这里
| up_gate_proj_weight: paddle.Tensor, | ||
| down_proj_weight: paddle.Tensor, | ||
| up_gate_proj_bias: Optional[paddle.Tensor], | ||
| up_gate_proj_scale: Optional[paddle.Tensor], | ||
| down_proj_scale: Optional[paddle.Tensor], | ||
| down_proj_in_scale: Optional[paddle.Tensor], | ||
| ffn1_weight: paddle.Tensor, | ||
| ffn2_weight: paddle.Tensor, | ||
| ffn1_bias: Optional[paddle.Tensor], | ||
| ffn1_scale: Optional[paddle.Tensor], | ||
| ffn2_scale: Optional[paddle.Tensor], | ||
| ffn2_in_scale: Optional[paddle.Tensor], | ||
| expert_idx_per_token: Optional[paddle.Tensor], | ||
| quant_method: str, | ||
| used_in_ep_low_latency: bool, | ||
| ): | ||
| assert up_gate_proj_bias is None | ||
| assert up_gate_proj_scale is not None | ||
| assert down_proj_scale is not None | ||
| assert down_proj_in_scale is None | ||
| assert ffn1_bias is None | ||
| assert ffn1_scale is not None | ||
| assert ffn2_scale is not None | ||
| assert ffn2_in_scale is None | ||
| assert expert_idx_per_token is None | ||
| assert quant_method in ("weight_only_int8") | ||
| assert not used_in_ep_low_latency | ||
| tokens_expert_prefix_sum_cpu = tokens_expert_prefix_sum.to("cpu") | ||
| up_gate_proj_output = paddle.empty( | ||
| [permute_input.shape[0], up_gate_proj_weight.shape[1]], | ||
| dtype=permute_input.dtype, | ||
| ) | ||
| group_gemm( | ||
| permute_input, | ||
| tokens_expert_prefix_sum_cpu, | ||
| up_gate_proj_weight, | ||
| up_gate_proj_scale, | ||
| up_gate_proj_output, | ||
| ) | ||
| act_out = swiglu(up_gate_proj_output) | ||
| output = paddle.empty([act_out.shape[0], down_proj_weight.shape[1]], dtype=act_out.dtype) | ||
| group_gemm( | ||
| act_out, | ||
| tokens_expert_prefix_sum_cpu, | ||
| down_proj_weight, | ||
| down_proj_scale, | ||
| output, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些建议保持原命名,FD中不建议出现ffn1/ffn2字样
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的
麻烦提个修复PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已提:#3273
FD在天数硬件上的第一版性能优化,具体优化策略有:
该版基于GSM8K数据集跑erine45 300B模型总体耗时约6.3h,精度0.964