-
Notifications
You must be signed in to change notification settings - Fork 683
[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Executor] Refactor GetBlockShapeAndSplitKVBlock Kernel #2989
Conversation
|
Thanks for your contribution! |
d589fb7 to
7226338
Compare
fastdeploy/model_executor/layers/attention/append_attn_backend.py
Outdated
Show resolved
Hide resolved
| encoder_block_shape_q: int = -1, | ||
| decoder_block_shape_q: int = -1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这俩参数不是所有 backend 共有的,是不是放 init metadata 里比较好?
| encoder_block_shape_q = 64 | ||
| decoder_block_shape_q = 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个参数不方便传进来,有什么好的方法吗?@freeliuzc
| }) | ||
| .Outputs({ | ||
| paddle::Optional("encoder_batch_ids"), | ||
| paddle::Optional("encoder_tile_ids_per_batch"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的Optional如果对性能不大的话,就去掉返回空tensor呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的Optional如果对性能不大的话,就去掉返回空tensor呢
encoder的这三个tensor 在纯解码阶段返回的是 shape 为 0 的 tensor,Mix 或纯Prefill 阶段返回非空Tensor是符合预期的。是想也固定下Return的shape吗?
f076e56 to
4ffa462
Compare
| # MLA | ||
| metadata.max_enc_len_this_time = metadata.set_max_lengths[1] | ||
| metadata.max_dec_len_this_time = metadata.set_max_lengths[2] | ||
| forward_meta.max_enc_len_this_time = metadata.set_max_lengths[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deepseek 组网那会使用 forward_meta.max_dec_len_this_time 来判断prefill和decode. 这里直接删除后,组网那里会有问题
Summary
GetBlockShapeAndSplitKVBlockis an operator in pre-processing that calculates the input of certain types ofAttentionBackend. In previous implementations, these inputs were not directly managed by the model runner。The root cause is the unreasonable boundary division betweenForwardMetaandAttentionMetaData, which resulted in two issues:Memcpyrequired to implement CudaGraph functionality, which leads to low performanceCurrent PR only addresses the second issue. The scope of modification includes:
GetBlockShapeAndSplitKVBlockkernelContrast
When the Batch Size of Erine-21B Model is 64, in some cases a single step can be reduced by 2ms. The larger the Batch Size, the higher the acceleration ratio.