[Iluvatar GPU] Optimze attention and moe performance #3234

wuyujiji · 2025-08-06T03:27:01Z

FD在天数硬件上的第一版性能优化，具体优化策略有：

支持了decode fused rope attn，端到端性能提升25%；
moe优化，端到端性能提升51%；
优化attn的前后处理，端到端性能提升1.9倍；

该版基于GSM8K数据集跑erine45 300B模型总体耗时约6.3h，精度0.964

CLAassistant · 2025-08-06T03:27:07Z

All committers have signed the CLA.

paddle-bot · 2025-08-06T06:00:39Z

Thanks for your contribution!

tianshuo78520a

LGTM CI

yongqiangma

LGTM

DDDivano

LGTM

gongshaotian · 2025-08-08T03:17:10Z

fastdeploy/worker/iluvatar_worker.py



-class IluvatarWorker(WorkerBase):
+class IluvatarWorker(GpuWorker):


iluvatar 的执行流程为什么要跟 gpu 耦合在一起？

iluvatar 的执行流程为什么要跟 gpu 耦合在一起？

这次从0630升级到最新commit发现gpu_model_runner改变很大，而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的，除了import部分算子有区别，其他都是一样的。这次这样改动，以后在升级适配可以直接服用gpu_model_runner的流程，避免在做copy的重复工作，如果出现了不兼容的流程，我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work

iluvatar 的执行流程为什么要跟 gpu 耦合在一起？

这次从0630升级到最新commit发现gpu_model_runner改变很大，而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的，除了import部分算子有区别，其他都是一样的。这次这样改动，以后在升级适配可以直接服用gpu_model_runner的流程，避免在做copy的重复工作，如果出现了不兼容的流程，我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work

好的，6.30之后执行流程一直在快速迭代，等执行器稳定后不同硬件的执行器需要重新隔离开

ok，等稳定了在隔开

gongshaotian · 2025-08-08T03:22:50Z

fastdeploy/worker/iluvatar_worker.py

-    def initialize_cache(self, num_gpu_blocks: int) -> None:
-        """ """
-        self.model_runner.update_share_input_block_num(num_gpu_blocks=num_gpu_blocks)
+class IluvatarPaddleDisWorkerProc(PaddleDisWorkerProc):


worker proc 不应该放在 worker 层级，这里留个 TODO 项吧。之前的没有预留支持多个 worker proc 的接口，待FastDeploy 的执行器重构后能架构会更合理一些。

好的，等重构后我再改下这里

yuanlehome · 2025-08-08T03:24:44Z

fastdeploy/model_executor/ops/iluvatar/moe_ops.py

-    up_gate_proj_weight: paddle.Tensor,
-    down_proj_weight: paddle.Tensor,
-    up_gate_proj_bias: Optional[paddle.Tensor],
-    up_gate_proj_scale: Optional[paddle.Tensor],
-    down_proj_scale: Optional[paddle.Tensor],
-    down_proj_in_scale: Optional[paddle.Tensor],
+    ffn1_weight: paddle.Tensor,
+    ffn2_weight: paddle.Tensor,
+    ffn1_bias: Optional[paddle.Tensor],
+    ffn1_scale: Optional[paddle.Tensor],
+    ffn2_scale: Optional[paddle.Tensor],
+    ffn2_in_scale: Optional[paddle.Tensor],
    expert_idx_per_token: Optional[paddle.Tensor],
    quant_method: str,
    used_in_ep_low_latency: bool,
 ):
-    assert up_gate_proj_bias is None
-    assert up_gate_proj_scale is not None
-    assert down_proj_scale is not None
-    assert down_proj_in_scale is None
+    assert ffn1_bias is None
+    assert ffn1_scale is not None
+    assert ffn2_scale is not None
+    assert ffn2_in_scale is None
    assert expert_idx_per_token is None
    assert quant_method in ("weight_only_int8")
    assert not used_in_ep_low_latency
    tokens_expert_prefix_sum_cpu = tokens_expert_prefix_sum.to("cpu")
-    up_gate_proj_output = paddle.empty(
-        [permute_input.shape[0], up_gate_proj_weight.shape[1]],
-        dtype=permute_input.dtype,
-    )
-    group_gemm(
-        permute_input,
-        tokens_expert_prefix_sum_cpu,
-        up_gate_proj_weight,
-        up_gate_proj_scale,
-        up_gate_proj_output,
-    )
-    act_out = swiglu(up_gate_proj_output)
-    output = paddle.empty([act_out.shape[0], down_proj_weight.shape[1]], dtype=act_out.dtype)
-    group_gemm(
-        act_out,
-        tokens_expert_prefix_sum_cpu,
-        down_proj_weight,
-        down_proj_scale,
-        output,
-    )


这些建议保持原命名，FD中不建议出现ffn1/ffn2字样

好的

麻烦提个修复PR

已提：#3273

wuyujiji force-pushed the iluvatar_optim branch 2 times, most recently from 863a582 to 9516b2d Compare August 6, 2025 03:45

paddle-bot bot added the contributor External developers label Aug 6, 2025

wuyujiji force-pushed the iluvatar_optim branch 3 times, most recently from a315aa2 to 9748bd1 Compare August 6, 2025 07:32

yongqiangma previously approved these changes Aug 6, 2025

View reviewed changes

wuyujiji dismissed yongqiangma’s stale review via 9c70401 August 6, 2025 08:34

wuyujiji force-pushed the iluvatar_optim branch 3 times, most recently from 1a8a265 to 5cb095b Compare August 7, 2025 05:01

[Iluvatar GPU] Optimze attention and moe performance

c9a0ffd

wuyujiji force-pushed the iluvatar_optim branch from 5cb095b to c9a0ffd Compare August 7, 2025 05:07

tianshuo78520a approved these changes Aug 7, 2025

View reviewed changes

yongqiangma approved these changes Aug 7, 2025

View reviewed changes

DDDivano approved these changes Aug 8, 2025

View reviewed changes

phlrain approved these changes Aug 8, 2025

View reviewed changes

Jiang-Jia-Jun merged commit fbdd6b0 into PaddlePaddle:develop Aug 8, 2025
21 of 28 checks passed

gongshaotian reviewed Aug 8, 2025

View reviewed changes

yuanlehome reviewed Aug 8, 2025

View reviewed changes

wuyujiji mentioned this pull request Aug 8, 2025

[Iluvatar GPU] Modify the names of some variables #3273

Merged



		class IluvatarWorker(WorkerBase):
		class IluvatarWorker(GpuWorker):

[Iluvatar GPU] Optimze attention and moe performance #3234

[Iluvatar GPU] Optimze attention and moe performance #3234

Uh oh!

Conversation

wuyujiji commented Aug 6, 2025

Uh oh!

CLAassistant commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Aug 6, 2025

Uh oh!

tianshuo78520a left a comment

Choose a reason for hiding this comment

Uh oh!

yongqiangma left a comment

Choose a reason for hiding this comment

Uh oh!

DDDivano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

CLAassistant commented Aug 6, 2025 •

edited

Loading