[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989

gongshaotian · 2025-07-23T09:59:27Z

Summary

GetBlockShapeAndSplitKVBlock is an operator in pre-processing that calculates the input of certain types of AttentionBackend. In previous implementations, these inputs were not directly managed by the model runner。The root cause is the unreasonable boundary division between ForwardMeta and AttentionMetaData, which resulted in two issues:

It is easy to ignore these additional model inputs when implementing other functions, leading to bugs
This operator does not use buffers properly, resulting in frequent Memcpy required to implement CudaGraph functionality, which leads to low performance

Current PR only addresses the second issue. The scope of modification includes:

GetBlockShapeAndSplitKVBlock kernel
Attention Backend Zoo: AppendAttention, MLA Attention, Flash Attention3, BlockMultiHead Attention
ModelRunner Zoo: GPU, GCU, iluvatar ModelRunner And MTP

Contrast

When the Batch Size of Erine-21B Model is 64, in some cases a single step can be reduced by 2ms. The larger the Batch Size, the higher the acceleration ratio.

paddle-bot · 2025-07-23T09:59:56Z

Thanks for your contribution!

fastdeploy/model_executor/forward_meta.py

fastdeploy/model_executor/layers/attention/append_attn_backend.py

fastdeploy/worker/gpu_model_runner.py

gongshaotian · 2025-07-29T14:35:49Z

fastdeploy/model_executor/layers/attention/append_attn_backend.py

+        encoder_block_shape_q: int = -1,
+        decoder_block_shape_q: int = -1,


这俩参数不是所有 backend 共有的，是不是放 init metadata 里比较好？

gongshaotian · 2025-07-30T09:22:47Z

fastdeploy/spec_decode/mtp.py

+        encoder_block_shape_q = 64
+        decoder_block_shape_q = 16


这两个参数不方便传进来，有什么好的方法吗？@freeliuzc

fastdeploy/spec_decode/mtp.py

fastdeploy/worker/gpu_model_runner.py

custom_ops/gpu_ops/append_attn/get_block_shape_and_split_kv_block.cu

ming1753 · 2025-07-30T09:37:46Z

custom_ops/gpu_ops/append_attn/get_block_shape_and_split_kv_block.cu

+    })
+    .Outputs({
+      paddle::Optional("encoder_batch_ids"),
+      paddle::Optional("encoder_tile_ids_per_batch"),


这里的Optional如果对性能不大的话，就去掉返回空tensor呢

这里的Optional如果对性能不大的话，就去掉返回空tensor呢

encoder的这三个tensor 在纯解码阶段返回的是 shape 为 0 的 tensor，Mix 或纯Prefill 阶段返回非空Tensor是符合预期的。是想也固定下Return的shape吗？

fastdeploy/model_executor/layers/attention/flash_attn_backend.py

K11OntheBoat · 2025-07-30T09:31:02Z

fastdeploy/model_executor/layers/attention/mla_attention_backend.py

        # MLA
-        metadata.max_enc_len_this_time = metadata.set_max_lengths[1]
-        metadata.max_dec_len_this_time = metadata.set_max_lengths[2]
-        forward_meta.max_enc_len_this_time = metadata.set_max_lengths[1]


deepseek 组网那会使用 forward_meta.max_dec_len_this_time 来判断prefill和decode. 这里直接删除后，组网那里会有问题

gongshaotian closed this Jul 23, 2025

gongshaotian reopened this Jul 28, 2025

gongshaotian force-pushed the refactor_get_block_op branch 2 times, most recently from d589fb7 to 7226338 Compare July 28, 2025 16:29

gongshaotian marked this pull request as ready for review July 28, 2025 16:35

gongshaotian commented Jul 29, 2025

View reviewed changes

fastdeploy/model_executor/forward_meta.py Show resolved Hide resolved

gongshaotian commented Jul 29, 2025

View reviewed changes

fastdeploy/model_executor/layers/attention/append_attn_backend.py Outdated Show resolved Hide resolved

gongshaotian commented Jul 29, 2025

View reviewed changes

fastdeploy/worker/gpu_model_runner.py Show resolved Hide resolved

gongshaotian commented Jul 29, 2025

View reviewed changes

gongshaotian changed the title ~~[Executor] Reset decoder_block_shape_q buffer~~ [Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel Jul 29, 2025

gongshaotian changed the title ~~[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel~~ [WIP] Refactor GetBlockShapeAndSplitKVBlock Kernel Jul 29, 2025

gongshaotian changed the title ~~[WIP] Refactor GetBlockShapeAndSplitKVBlock Kernel~~ [Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel Jul 30, 2025

gongshaotian commented Jul 30, 2025

View reviewed changes

freeliuzc reviewed Jul 30, 2025

View reviewed changes

fastdeploy/spec_decode/mtp.py Show resolved Hide resolved

fastdeploy/worker/gpu_model_runner.py Show resolved Hide resolved

ming1753 reviewed Jul 30, 2025

View reviewed changes

yuanlehome approved these changes Jul 30, 2025

View reviewed changes

gongshaotian added 11 commits July 30, 2025 22:54

reset decoder_block_shape_q buffer

cbf181e

refactor GetBlockShapeAndSplitKVBlock Kernel and cudagraph padding batch

a7f9cc7

update decode_max_tile_size

8d5f053

fix pre-commit

8f67299

update block_multihead_attn_backend

bde0909

update flas attn backend

7e99db4

update MLA Attention

d1ccac8

update XPU Attention

ea909ba

update gcu,iluvatar model runner

fab88e0

Update MTP

d99caba

fix MTP bug

4ffa462

gongshaotian force-pushed the refactor_get_block_op branch from f076e56 to 4ffa462 Compare July 30, 2025 14:54

gongshaotian merged commit d850660 into PaddlePaddle:develop Jul 30, 2025
9 of 12 checks passed

gongshaotian commented Jul 31, 2025

View reviewed changes

fastdeploy/model_executor/layers/attention/flash_attn_backend.py Show resolved Hide resolved

gongshaotian mentioned this pull request Jul 31, 2025

[Executor] Fix bug of FlashAttentionBackend #3118

Closed

K11OntheBoat reviewed Aug 1, 2025

View reviewed changes

This was referenced Aug 1, 2025

[Executor]Fix get_block_shape_and_split_kv_block Kernel typo #3153

Merged

[Bug Fix] Fix bug of MLA Attention Backend #3176

Merged

[Bug Fix] Fix bug of MLA Attention Backend #3178

Merged

gongshaotian deleted the refactor_get_block_op branch August 20, 2025 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989

[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989

Uh oh!

gongshaotian commented Jul 23, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gongshaotian Jul 29, 2025 •

edited

Loading

Uh oh!

gongshaotian Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ming1753 Jul 30, 2025

Uh oh!

gongshaotian Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

K11OntheBoat Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		encoder_block_shape_q: int = -1,
		decoder_block_shape_q: int = -1,

[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989

[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989

Uh oh!

Conversation

gongshaotian commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Contrast

Uh oh!

paddle-bot bot commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gongshaotian Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gongshaotian Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ming1753 Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

gongshaotian Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

K11OntheBoat Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gongshaotian commented Jul 23, 2025 •

edited

Loading

gongshaotian Jul 29, 2025 •

edited

Loading

gongshaotian Jul 30, 2025 •

edited

Loading

gongshaotian Jul 30, 2025 •

edited

Loading