Skip to content

Conversation

@wuyujiji
Copy link
Contributor

@wuyujiji wuyujiji commented Aug 6, 2025

FD在天数硬件上的第一版性能优化,具体优化策略有:

  1. 支持了decode fused rope attn,端到端性能提升25%;
  2. moe优化,端到端性能提升51%;
  3. 优化attn的前后处理,端到端性能提升1.9倍;

该版基于GSM8K数据集跑erine45 300B模型总体耗时约6.3h,精度0.964

@CLAassistant
Copy link

CLAassistant commented Aug 6, 2025

CLA assistant check
All committers have signed the CLA.

@wuyujiji wuyujiji force-pushed the iluvatar_optim branch 2 times, most recently from 863a582 to 9516b2d Compare August 6, 2025 03:45
@paddle-bot
Copy link

paddle-bot bot commented Aug 6, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Aug 6, 2025
@wuyujiji wuyujiji force-pushed the iluvatar_optim branch 3 times, most recently from a315aa2 to 9748bd1 Compare August 6, 2025 07:32
yongqiangma
yongqiangma previously approved these changes Aug 6, 2025
@wuyujiji wuyujiji force-pushed the iluvatar_optim branch 3 times, most recently from 1a8a265 to 5cb095b Compare August 7, 2025 05:01
Copy link
Collaborator

@tianshuo78520a tianshuo78520a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM CI

Copy link
Collaborator

@yongqiangma yongqiangma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@DDDivano DDDivano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit fbdd6b0 into PaddlePaddle:develop Aug 8, 2025
21 of 28 checks passed


class IluvatarWorker(WorkerBase):
class IluvatarWorker(GpuWorker):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iluvatar 的执行流程为什么要跟 gpu 耦合在一起?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iluvatar 的执行流程为什么要跟 gpu 耦合在一起?

这次从0630升级到最新commit发现gpu_model_runner改变很大,而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的,除了import部分算子有区别,其他都是一样的。这次这样改动,以后在升级适配可以直接服用gpu_model_runner的流程,避免在做copy的重复工作,如果出现了不兼容的流程,我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iluvatar 的执行流程为什么要跟 gpu 耦合在一起?

这次从0630升级到最新commit发现gpu_model_runner改变很大,而之前适配的iluvatar_model_runner就是从gpu_model_runner copy的,除了import部分算子有区别,其他都是一样的。这次这样改动,以后在升级适配可以直接服用gpu_model_runner的流程,避免在做copy的重复工作,如果出现了不兼容的流程,我们会在iluvatar_model_runner重新覆盖该成员函数保证可以work

好的,6.30之后执行流程一直在快速迭代,等执行器稳定后不同硬件的执行器需要重新隔离开

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok,等稳定了在隔开

def initialize_cache(self, num_gpu_blocks: int) -> None:
""" """
self.model_runner.update_share_input_block_num(num_gpu_blocks=num_gpu_blocks)
class IluvatarPaddleDisWorkerProc(PaddleDisWorkerProc):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worker proc 不应该放在 worker 层级,这里留个 TODO 项吧。之前的没有预留支持多个 worker proc 的接口, 待FastDeploy 的执行器重构后能架构会更合理一些。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的,等重构后我再改下这里

Comment on lines -81 to -118
up_gate_proj_weight: paddle.Tensor,
down_proj_weight: paddle.Tensor,
up_gate_proj_bias: Optional[paddle.Tensor],
up_gate_proj_scale: Optional[paddle.Tensor],
down_proj_scale: Optional[paddle.Tensor],
down_proj_in_scale: Optional[paddle.Tensor],
ffn1_weight: paddle.Tensor,
ffn2_weight: paddle.Tensor,
ffn1_bias: Optional[paddle.Tensor],
ffn1_scale: Optional[paddle.Tensor],
ffn2_scale: Optional[paddle.Tensor],
ffn2_in_scale: Optional[paddle.Tensor],
expert_idx_per_token: Optional[paddle.Tensor],
quant_method: str,
used_in_ep_low_latency: bool,
):
assert up_gate_proj_bias is None
assert up_gate_proj_scale is not None
assert down_proj_scale is not None
assert down_proj_in_scale is None
assert ffn1_bias is None
assert ffn1_scale is not None
assert ffn2_scale is not None
assert ffn2_in_scale is None
assert expert_idx_per_token is None
assert quant_method in ("weight_only_int8")
assert not used_in_ep_low_latency
tokens_expert_prefix_sum_cpu = tokens_expert_prefix_sum.to("cpu")
up_gate_proj_output = paddle.empty(
[permute_input.shape[0], up_gate_proj_weight.shape[1]],
dtype=permute_input.dtype,
)
group_gemm(
permute_input,
tokens_expert_prefix_sum_cpu,
up_gate_proj_weight,
up_gate_proj_scale,
up_gate_proj_output,
)
act_out = swiglu(up_gate_proj_output)
output = paddle.empty([act_out.shape[0], down_proj_weight.shape[1]], dtype=act_out.dtype)
group_gemm(
act_out,
tokens_expert_prefix_sum_cpu,
down_proj_weight,
down_proj_scale,
output,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这些建议保持原命名,FD中不建议出现ffn1/ffn2字样

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的

麻烦提个修复PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已提:#3273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants