-
Notifications
You must be signed in to change notification settings - Fork 683
[Executor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding #3223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Executor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding #3223
Conversation
|
Thanks for your contribution! |
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| chunk_size = static_cast<uint32_t>(encoder_max_partition_size); | ||
| } | ||
| const int num_chunks = div_up(max_dec_len, chunk_size); | ||
| const int num_chunks = div_up(encoder_max_partition_size, chunk_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个改动的原因是?如果是为了固定num_chunk 建议使用max_seq_len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个不会导致launch kernel有资源冗余吗?性能因此有下降?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
关于资源冗余问题,首先申请的显存一定会冗余,这个避免起来比较困难,然后计算资源的话,在multi_query_append_attention_warp1_4_kernel中,这个kernel会根据num_chunks(当前batch中最长的seq算出来的)作为启动参数,确实会多启动一些,但是由于原本的设计本来就会面临一个batch中有不同num_chunks_this_seq(这个seq算出来的)的请求的情况,所以原本就有提前退出而避免浪费计算资源的情况。
if (chunk_idx >= num_chunks_this_seq) {
return;
}
来避免计算资源的浪费。
然后在merge_multi_chunks_decoder_kernel中,这个kernel的启动参数和num_chunks无关,这个比较cuda graph友好,内部关于num_chunks_this_seq的处理是循环处理。里面和num_chunks有关的就是去计算一些偏移量,这个kernel可以说改进前后的资源使用率是一致的,没有影响。
然后性能问题,前面贴出来测试结果表明性能确实解码速度有所下降,开启cuda graph后没有完全补充回来。但是由于并发数提高导致延迟降低了。至于为什么并发数会提高还有待分析。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确实是为了固定固定num_chunk,encoder_max_partition_size目前是用max_seq_len赋值的,目前可以认为是一个东西,但是使用encoder_max_partition_size的含义之后可能会更换,并且max_seq_len更好理解,应该使用max_seq_len。
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ifferent branches during decoding (#3223) (#3512) * 彻底解决解码切块问题 * update C8 and C4 kernel * fix problem * fix with pre-commit * retain branch for mtp Co-authored-by: Jundong Liu <[email protected]>



本质原因是当前append attention在长文本时的设计(大于max_partition_size)与cuda graph不兼容,之前问题未暴露应该是没有用cuda graph去处理长文本场景。
第一个问题,使用nosplit_kv_kernel分支只会调用multi_query_append_attention_warp1_4_kernel,而split_kv_kernel分支会调用multi_query_append_attention_warp1_4_kernel与merge_multi_chunks_decoder_kernel。由于capture和replay走不同的分支会导致cuda error 700。
解决方法:num_chunks一定大于等于1,将加入nosplit_kv_kernel分支的条件改为num_chunks<=0即可避免进入该分支,之后会将这个分支删除,目前的写法只是最小改动下方便理解的写法。
第二个问题,原本split_kv_kernel分支中,启动multi_query_append_attention_warp1_4_kernel的参数与num_chunks有关,同时临时空间申请(tmp_workspace,temp_p,temp_d)的大小也与num_chunks有关。由于kernel的启动参数与空间申请大小在cuda graph中被捕获时就是固定的,这导致在解决第一个问题后,捕获num_chunks数小的graph去replay num_chunks数大的请求时会出现解码得到的情况。
解决方法:不使用当前batch中seq_len最长的去计算num_chunks,而是直接使用理论上能得到的最大num_chunks数目(div_up(encoder_max_partition_size, chunk_size),encoder_max_partition_sizes实际是启动服务时的参数max_model_len)去启动kernel,去申请空间。当之后如果encoder_max_partition_size意义更改时,这里也要更换。
后续:为了代码简洁性,需要删除一些分支,同时c8与c4的算子也需要更改。