Skip to content

Conversation

@littledgg
Copy link
Contributor

本质原因是当前append attention在长文本时的设计(大于max_partition_size)与cuda graph不兼容,之前问题未暴露应该是没有用cuda graph去处理长文本场景。
第一个问题,使用nosplit_kv_kernel分支只会调用multi_query_append_attention_warp1_4_kernel,而split_kv_kernel分支会调用multi_query_append_attention_warp1_4_kernel与merge_multi_chunks_decoder_kernel。由于capture和replay走不同的分支会导致cuda error 700。
解决方法:num_chunks一定大于等于1,将加入nosplit_kv_kernel分支的条件改为num_chunks<=0即可避免进入该分支,之后会将这个分支删除,目前的写法只是最小改动下方便理解的写法。
第二个问题,原本split_kv_kernel分支中,启动multi_query_append_attention_warp1_4_kernel的参数与num_chunks有关,同时临时空间申请(tmp_workspace,temp_p,temp_d)的大小也与num_chunks有关。由于kernel的启动参数与空间申请大小在cuda graph中被捕获时就是固定的,这导致在解决第一个问题后,捕获num_chunks数小的graph去replay num_chunks数大的请求时会出现解码得到的情况。
解决方法:不使用当前batch中seq_len最长的去计算num_chunks,而是直接使用理论上能得到的最大num_chunks数目(div_up(encoder_max_partition_size, chunk_size),encoder_max_partition_sizes实际是启动服务时的参数max_model_len)去启动kernel,去申请空间。当之后如果encoder_max_partition_size意义更改时,这里也要更换。
后续:为了代码简洁性,需要删除一些分支,同时c8与c4的算子也需要更改。

@paddle-bot
Copy link

paddle-bot bot commented Aug 5, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Aug 5, 2025
@littledgg
Copy link
Contributor Author

之前的close了的版本
#3086
#3104

EmmonsCurse
EmmonsCurse previously approved these changes Aug 6, 2025
@littledgg
Copy link
Contributor Author

littledgg commented Aug 6, 2025

这是lite模型关于算子改造前后性能变化的测试结果,解码速度有轻微降低,开启graph后能补回来。同时MTP场景下性能没有影响,接受率不受影响。

image

以下是启动参数脚本

python -m fastdeploy.entrypoints.openai.api_server --model ${model_path} \
    --max-num-seqs 256 --max-model-len 32768 \
    --port 8888 --engine-worker-queue-port 7102 \
    --metrics-port 7203 --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --graph-optimization-config ' {"use_cudagraph":true}' \

以下是请求脚本

# benchmarks/yaml/request_yaml/cuda_graph_test.yaml

top_p: 0.8
temperature: 0.8
metadata:
  min_tokens: 1024
max_tokens: 1024
repetition_penalty: 1.0
frequency_penalty: 0
presence_penalty: 0
# benchmarks/benchmark_serving.sh

# 保存infer_log.txt
python benchmark_serving.py \
  --backend openai-chat \
  --model EB45T \
  --endpoint /v1/chat/completions \
  --host 0.0.0.0 \
  --port 8888 \
  --dataset-name EBChat \
  --hyperparameter-path ./yaml/request_yaml/cuda_graph_test.yaml \
  --dataset-path ./filtered_sharedgpt_2000_input_1136_output_200_fd.json \
  --percentile-metrics ttft,tpot,itl,e2el,s_ttft,s_itl,s_e2el,s_decode,input_len,s_input_len,output_len \
  --metric-percentiles 80,95,99,99.9,99.95,99.99 \
  --num-prompts 2000 \
  --max-concurrency 256 \
  --save-result > ./infer_log.txt 2>&1 &

@littledgg
Copy link
Contributor Author

littledgg commented Aug 6, 2025

关于C8算子,300B的量化模型在开启多卡,chunked prefill,cuda graph的情况下,可以得到正确的结果,精度没有问题。
image

image

C4算子暂时没有模型支持验证,先同步更改。

gongshaotian
gongshaotian previously approved these changes Aug 6, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

yuanlehome
yuanlehome previously approved these changes Aug 7, 2025
chunk_size = static_cast<uint32_t>(encoder_max_partition_size);
}
const int num_chunks = div_up(max_dec_len, chunk_size);
const int num_chunks = div_up(encoder_max_partition_size, chunk_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个改动的原因是?如果是为了固定num_chunk 建议使用max_seq_len

Copy link
Collaborator

@yuanlehome yuanlehome Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个不会导致launch kernel有资源冗余吗?性能因此有下降?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

关于资源冗余问题,首先申请的显存一定会冗余,这个避免起来比较困难,然后计算资源的话,在multi_query_append_attention_warp1_4_kernel中,这个kernel会根据num_chunks(当前batch中最长的seq算出来的)作为启动参数,确实会多启动一些,但是由于原本的设计本来就会面临一个batch中有不同num_chunks_this_seq(这个seq算出来的)的请求的情况,所以原本就有提前退出而避免浪费计算资源的情况。

if (chunk_idx >= num_chunks_this_seq) {
    return;
  }

来避免计算资源的浪费。
然后在merge_multi_chunks_decoder_kernel中,这个kernel的启动参数和num_chunks无关,这个比较cuda graph友好,内部关于num_chunks_this_seq的处理是循环处理。里面和num_chunks有关的就是去计算一些偏移量,这个kernel可以说改进前后的资源使用率是一致的,没有影响。
然后性能问题,前面贴出来测试结果表明性能确实解码速度有所下降,开启cuda graph后没有完全补充回来。但是由于并发数提高导致延迟降低了。至于为什么并发数会提高还有待分析。

Copy link
Contributor Author

@littledgg littledgg Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确实是为了固定固定num_chunk,encoder_max_partition_size目前是用max_seq_len赋值的,目前可以认为是一个东西,但是使用encoder_max_partition_size的含义之后可能会更换,并且max_seq_len更好理解,应该使用max_seq_len。

@littledgg littledgg dismissed stale reviews from yuanlehome and gongshaotian via 9cd514c August 7, 2025 06:51
gongshaotian
gongshaotian previously approved these changes Aug 7, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gongshaotian gongshaotian merged commit 1e4968e into PaddlePaddle:develop Aug 8, 2025
12 of 14 checks passed
@gongshaotian gongshaotian changed the title [Excutor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding [Executor] Fixed the issue of CUDA graph execution failure caused by different branches during decoding Aug 14, 2025
Jiang-Jia-Jun pushed a commit that referenced this pull request Aug 21, 2025
…ifferent branches during decoding (#3223) (#3512)

* 彻底解决解码切块问题

* update C8 and C4 kernel

* fix problem

* fix with pre-commit

* retain branch for mtp

Co-authored-by: Jundong Liu <[email protected]>
@littledgg littledgg deleted the long_seq_cudagraph branch November 28, 2025 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants