[SOT] Make custom_op dy&st unified #2733

DrRyanHuang · 2025-07-07T12:07:36Z

PR描述

# 文件位置 fastdeploy/model_executor/ops/gpu/fastdeploy_ops.py

@unified
def static_op_extract_text_token_output(max_seq_len,max_seq_len_index,mm_token_num_len,seq_lens_this_time,cu_seqlens_q,score_text):
    # The output variable's dtype use default value 'float32',
    # and the actual dtype of output variable will be inferred in runtime.
    if in_dynamic_or_pir_mode():
        outs = _C_ops._run_custom_op("static_op_extract_text_token_output", max_seq_len,max_seq_len_index,mm_token_num_len,seq_lens_this_time,cu_seqlens_q,score_text)
        res = []
        start_idx = 0
        res.append(outs[start_idx])
        start_idx += 1
        print("static_op_extract_text_token_output op original res is: ", res)
        return res[0] if len(res)==1 else res

上述生成的 static_op_extract_text_token_output 函数实现中，存在对输出结果根据长度进行解包的逻辑：
当输出结果为单元素列表时，直接返回该元素，否则返回完整的列表。
这种逻辑导致了动态图与静态图模式下接口返回值类型不一致的问题。

在SOT转静执行过程（包括模拟）中，返回值始终经过上述解包处理；
而在动态图模式下，原始实现直接返回底层C++扩展算子的输出，所以我们做进一步处理：

	# 文件位置 fastdeploy/import_ops.py
    @functools.wraps(original_custom_op)
    def unified_op(*args, **kwargs):
        if paddle.in_dynamic_mode():
            res = original_cpp_ext_op(*args, **kwargs)
            if res is None:
                return None
            # TODO(DrRyanHuang): Remove this if when we align the implementation of custom op and C++ extension
            if isinstance(res, list) and len(res) == 1:
                return res[0]
            return res

cc @SigureMo

UPDATE 2025.07.11

记录一下解决BUG的过程，略有曲折，算是一个比较难排查的BUG吧：

开启SOT后报：

Error: /workspace/Paddle/paddle/phi/kernels/gpu/embedding_kernel.cu:41 
Assertion `id < N` failed. Id should smaller than 103424 but received an id value: 12884901890.
Error: /workspace/Paddle/paddle/phi/kernels/gpu/embedding_kernel.cu:41 
Assertion `id < N` failed. Id should smaller than 103424 but received an id value: 11712375818919.

我们把 Embedding 的输出打出来看：

ids_remove_padding: Tensor(shape=[969], dtype=int64, place=Place(gpu:0), stop_gradient=True,
       [706           , 12884901890   , 11712375818919, 11712375818919,
        4726599895656434343, 100273        , 3165          , 23            ,
        2969          , 93963         , 50343         , 93919         ,
        4             , 93963         , 101304        , 100295        ,
        100295        , 100295        , 100295        , 100295        ,

发现出现了很多莫名其妙的大数

和动态图对比了一下，本来应该为 -1 的每行第0个位置，变成了这些大数

我们知道，动态图第一列的数据相当于 mask，如果是 -1，说明当前只跑了一条数据，而为何SOT这里就会是一个很大的数呢?

还有个实验现象：开测试脚本为1条的时候，就没问题，开测试脚本为2条的时候，就会出现这个问题，我们起先认为是 slice 的问题

我们打印了 Embedding 的结果，发现第一次进入主循环是正常的，在第二次开始，就会有很多大数出现（但此时没报错）

第一次的输出也就是第二次的输入，这个过程有问题

从自定义算子 update_inputs 中找到了给第 0 列赋值的位置：

FastDeploy/custom_ops/gpu_ops/update_inputs.cu

Lines 48 to 60 in c08561c

    
           if (thread_idx < bsz) { 
        
               const int seq_len_this_time = seq_lens_this_time[thread_idx]; 
        
               const int seq_len_encoder = seq_lens_encoder[thread_idx]; 
        
               const int seq_len_decoder = seq_lens_decoder[thread_idx]; 
        
               seq_lens_decoder[thread_idx] = stop_flag_now ? 
        
                   0 : (seq_len_encoder > 0 ? 
        
                   (seq_len_encoder + seq_len_decoder) : seq_len_decoder + 1); 
        
               seq_lens_this_time[thread_idx] = stop_flag_now ? 0 : 1; 
        
               seq_lens_encoder[thread_idx] = 0; 
        
               int64_t *input_ids_now = input_ids + thread_idx * input_ids_stride; 
        
               input_ids_now[0] = next_tokens[thread_idx];

第60行会将 next_tokens 数据复制到 input_ids_now 的第0列，那现在就有两种可能了，要么 next_tokens 数据有问题，要么就是 input_ids_now 没被赋值，（看前面随机出现的大数，倾向于这个）

如果是没被赋值，看上面的if条件 thread_idx < bsz 除非 bsz=1，即使我们传入多条数据，也依旧只会给第一行赋值

也就是无论数据多少条，都会只取第一条，那很大概率就是存在没考虑到的 slice 或者索引操作

这时 @zyfncg 建议开 FLAGS_print_ir 打印一下 program 看看，并建议倒着看，可惜开发机环境崩了，没办法展示log了

从log中可以很明显的看出，存在一个只取第一个元素的操作，也就是：

FastDeploy/fastdeploy/model_executor/models/ernie4_5_vl/ernie4_5_vl_moe.py

Lines 441 to 448 in f6ffbc3

    
           hidden_states = extract_text_token_output( 
        
               max_seq_len, 
        
               max_seq_len_index.cast("int32"), 
        
               image_token_num, 
        
               forward_meta.seq_lens_this_time, 
        
               forward_meta.cu_seqlens_q, 
        
               score_text, 
        
           )[0].cast(self._dtype)

不对啊，之前和 @xiaoxiaohehe001 一起看过这个自定义算子的 infer_meta：

FastDeploy/custom_ops/gpu_ops/extract_text_token_output.cu

Lines 74 to 83 in f6ffbc3

    
           std::vector<std::vector<int64_t>> ExtractTextTokenOutputInferShape(const std::vector<int64_t>& max_seq_len_shape, 
        
                                                                        const std::vector<int64_t>& max_seq_len_index_shape, 
        
                                                                        const std::vector<int64_t>& mm_token_num_len_shape, 
        
                                                                        const std::vector<int64_t>& seq_lens_this_time_shape, 
        
                                                                        const std::vector<int64_t>& cu_seqlens_q_shape, 
        
                                                                        const std::vector<int64_t>& score_text_shape) { 
        
               const int bsz = seq_lens_this_time_shape[0]; 
        
               const int hidden_size = score_text_shape[1]; 
        
               return {{bsz, hidden_size}}; 
        
           }

他明明返回的是一个list啊? 我取0号元素，不就是把Tensor拿出来吗？怎么会是对Tensor做索引操作呢?

FastDeploy 中，动态图走 C++ extension ，静态图选择自定义算子，逻辑如下：

FastDeploy/fastdeploy/import_ops.py

Lines 78 to 81 in f6ffbc3

    
           def unified_op(*args, **kwargs): 
        
               if paddle.in_dynamic_mode(): 
        
                   return original_cpp_ext_op(*args, **kwargs) 
        
               return original_custom_op(*args, **kwargs)

回到 Github PR Description 开头的自定义算子（静态图用），这里会做一个解包的操作： return res[0] if len(res)==1 else res，如果返回值列表中只有一个元素，则直接返回该元素

而在动态图中由于没有这解包操作，所以需要手动在后面添加一个 [0] 的索引操作

所以，当跑SOT转静的时候，做了两次取0号元素的，所以会始终出现上述问题

这个PR只是在当前算子体系下加了个补丁，后续应该规范算子，C++层面统一返回 vector，而不是既有 vector<Tensor> 又有 Tensor

感谢 @zyfncg @SigureMo 和我一起排查

paddle-bot · 2025-07-07T12:07:41Z

Thanks for your contribution!

SigureMo · 2025-07-07T12:13:28Z

fastdeploy/import_ops.py

    """
    static_op_prefix = "static_op_"
    static_op_names = [k for k in global_ns if k.startswith(static_op_prefix)]
-    enforce_eager = int(os.getenv("FD_ENFORCE_EAGER", "0")) == 1


新的实现不再需要 FD_ENFORCE_EAGER，只需要通过 wrap_unified_op 即可通过轻量的方式动态选择最合适的方式：

动态图选择 C++ extension

静态图选择自定义算子

本 PR 通过在输出补充自定义算子中一直以来就存在的逻辑（如果只有一个元素则解包）确保动静统一，以免动静不一致

Jiang-Jia-Jun · 2025-07-08T11:42:08Z

同时修复issue #2739

make_custom_op dy&st unified

e286ce1

Merge branch 'develop' into make_custom_op_unified

f5b1343

SigureMo approved these changes Jul 7, 2025

View reviewed changes

DrRyanHuang mentioned this pull request Jul 7, 2025

[SOT] Enable SOT Dy2St in Multimodal Model #2735

Merged

DrRyanHuang added 3 commits July 8, 2025 12:42

Merge branch 'develop' into make_custom_op_unified

e8a6a77

Merge branch 'develop' into make_custom_op_unified

343d337

add instance judgement

de299fb

ming1753 approved these changes Jul 8, 2025

View reviewed changes

xiaoxiaohehe001 approved these changes Jul 8, 2025

View reviewed changes

ming1753 merged commit f72c4de into PaddlePaddle:develop Jul 8, 2025
3 checks passed

DrRyanHuang deleted the make_custom_op_unified branch July 8, 2025 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SOT] Make custom_op dy&st unified #2733

[SOT] Make custom_op dy&st unified #2733

Uh oh!

DrRyanHuang commented Jul 7, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 7, 2025

Uh oh!

SigureMo Jul 7, 2025

Uh oh!

Uh oh!

Jiang-Jia-Jun commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	if (thread_idx < bsz) {
	const int seq_len_this_time = seq_lens_this_time[thread_idx];
	const int seq_len_encoder = seq_lens_encoder[thread_idx];
	const int seq_len_decoder = seq_lens_decoder[thread_idx];

	seq_lens_decoder[thread_idx] = stop_flag_now ?
	0 : (seq_len_encoder > 0 ?
	(seq_len_encoder + seq_len_decoder) : seq_len_decoder + 1);

	seq_lens_this_time[thread_idx] = stop_flag_now ? 0 : 1;
	seq_lens_encoder[thread_idx] = 0;
	int64_t input_ids_now = input_ids + thread_idx input_ids_stride;
	input_ids_now[0] = next_tokens[thread_idx];

	hidden_states = extract_text_token_output(
	max_seq_len,
	max_seq_len_index.cast("int32"),
	image_token_num,
	forward_meta.seq_lens_this_time,
	forward_meta.cu_seqlens_q,
	score_text,
	)[0].cast(self._dtype)

	std::vector<std::vector<int64_t>> ExtractTextTokenOutputInferShape(const std::vector<int64_t>& max_seq_len_shape,
	const std::vector<int64_t>& max_seq_len_index_shape,
	const std::vector<int64_t>& mm_token_num_len_shape,
	const std::vector<int64_t>& seq_lens_this_time_shape,
	const std::vector<int64_t>& cu_seqlens_q_shape,
	const std::vector<int64_t>& score_text_shape) {
	const int bsz = seq_lens_this_time_shape[0];
	const int hidden_size = score_text_shape[1];
	return {{bsz, hidden_size}};
	}

	def unified_op(args, *kwargs):
	if paddle.in_dynamic_mode():
	return original_cpp_ext_op(args, *kwargs)
	return original_custom_op(args, *kwargs)

[SOT] Make custom_op dy&st unified #2733

[SOT] Make custom_op dy&st unified #2733

Uh oh!

Conversation

DrRyanHuang commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR描述

Uh oh!

paddle-bot bot commented Jul 7, 2025

Uh oh!

SigureMo Jul 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Jiang-Jia-Jun commented Jul 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DrRyanHuang commented Jul 7, 2025 •

edited

Loading