[Executor]CUDAGraph support Speculate Decode #3769

gongshaotian · 2025-09-01T08:52:10Z

Summary

CUDAGraph Support Speculate Decode. Currently, only N-Gram and MTP speculative decoding algorithms are supported.

N-Gram: The maximum capture size supported is 256
MTP: The maximum capture size supported is 512

This reverts commit 32b3962.

paddle-bot · 2025-09-01T08:52:15Z

Thanks for your contribution!

custom_ops/gpu_ops/rebuild_padding.cu

Merge develop And Enable target model in cuda graph

Mtp

This reverts commit 8351e83.

[Excutor] Enable only Target model in cudagraph v1.0

… mtp

custom_ops/gpu_ops/rebuild_padding.cu

Enable Target Model Padding And Draft Model in cudagraph

fastdeploy/worker/gpu_model_runner.py

custom_ops/gpu_ops/speculate_decoding/speculate_verify.cu

fastdeploy/spec_decode/mtp.py

This reverts commit 834639a.

Solve comments

merge develop

carryyu · 2025-09-24T09:16:03Z

custom_ops/gpu_ops/append_attn/append_attention_func.cuh

  __shared__ float md_smem[bdy * 2];
  for (int qid = blockIdx.x; qid < token_num; qid += gridDim.x) {
    const uint32_t bid = batch_id_per_token[qid];
+    if(bid == -1){


注意下编码规范

注意下编码规范

这里能把 bid 从 uint32_t 切换成 int 吗？取值范围变小了有无风险？

carryyu · 2025-09-24T09:16:13Z

custom_ops/gpu_ops/append_attn/append_attention_func.cuh

    const int num_chunks_this_seq = div_up(seq_len_kv, chunk_size);
    if (num_chunks_this_seq <= 1) {
      continue;
+    }else if (!ENABLE_PREFILL){


… mtp

fix bug

… mtp

custom_ops/gpu_ops/append_attention.cu

fastdeploy/spec_decode/mtp.py

custom_ops/gpu_ops/speculate_decoding/speculate_get_padding_offset.cu

fastdeploy/config.py

yuanlehome

LGTM

Deleter-D

LGTM

gongshaotian and others added 5 commits August 20, 2025 16:27

success run ngram

8351e83

Revert "[Code Simplification] remove cum_offsets (PaddlePaddle#3410)"

02e8384

This reverts commit 32b3962.

success run ngram5 tp4 42bs

1444ba6

success run ngram5 tp4 42bs

892c0c2

merge develop

18d9823

gongshaotian commented Sep 1, 2025

View reviewed changes

custom_ops/gpu_ops/rebuild_padding.cu Show resolved Hide resolved

mtp draft commit

64ea2f7

gongshaotian force-pushed the mtp branch from 3d573d1 to 64ea2f7 Compare September 2, 2025 11:11

littledgg and others added 4 commits September 8, 2025 11:27

enable target model in cuda graph

3263006

Merge pull request #1 from littledgg/mtp

5b75ade

Merge develop And Enable target model in cuda graph

add decorator for target model

4772a4f

enable draft model in cudagraph v0.5

4a0a6df

gongshaotian force-pushed the mtp branch from 1b9da7b to 5b75ade Compare September 12, 2025 03:35

littledgg and others added 13 commits September 12, 2025 11:36

revert revrt cum_offset

ec4a2df

Merge pull request #3 from littledgg/mtp

529214c

Mtp

enable target model in cudagraph v0.9 And clean debug code

2dd98da

Revert "success run ngram"

1d3ef67

This reverts commit 8351e83.

add reverted code

349988f

enable target model in cudagraph v0.9

15d3103

solve comment

7f11653

Merge pull request #4 from littledgg/mtp

bb9c911

[Excutor] Enable only Target model in cudagraph v1.0

merge remote mtp

d1115a7

merge develop & solve conflict

77e64ed

fix bid < 0

235b0ba

Enable Target Model Padding And Draft Model in cudagraph

3516be4

Merge branch 'mtp' of https://github.com/gongshaotian/FastDeploy into…

c6cdc17

… mtp

gongshaotian commented Sep 16, 2025

View reviewed changes

custom_ops/gpu_ops/rebuild_padding.cu Outdated Show resolved Hide resolved

littledgg and others added 2 commits September 16, 2025 21:28

solve problem

167fb58

Merge pull request #5 from littledgg/mtp

4c10571

Enable Target Model Padding And Draft Model in cudagraph

littledgg reviewed Sep 24, 2025

View reviewed changes

fastdeploy/worker/gpu_model_runner.py Outdated Show resolved Hide resolved