-
Notifications
You must be signed in to change notification settings - Fork 682
[Executor]CUDAGraph support Speculate Decode #3769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
Merge develop And Enable target model in cuda graph
This reverts commit 8351e83.
[Excutor] Enable only Target model in cudagraph v1.0
Enable Target Model Padding And Draft Model in cudagraph
| __shared__ float md_smem[bdy * 2]; | ||
| for (int qid = blockIdx.x; qid < token_num; qid += gridDim.x) { | ||
| const uint32_t bid = batch_id_per_token[qid]; | ||
| if(bid == -1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意下编码规范
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注意下编码规范
这里能把 bid 从 uint32_t 切换成 int 吗?取值范围变小了有无风险?
| const int num_chunks_this_seq = div_up(seq_len_kv, chunk_size); | ||
| if (num_chunks_this_seq <= 1) { | ||
| continue; | ||
| }else if (!ENABLE_PREFILL){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
yuanlehome
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Deleter-D
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
CUDAGraph Support Speculate Decode. Currently, only N-Gram and MTP speculative decoding algorithms are supported.