[fix] qwen output inconsistency when top_p=0 #3634
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
问题描述
Qwen-2-7b-Instruct 模型部署,请求设置 top_p=0,连续发送 2 次相同请求,输出结果存在差异。
产生原因
Diff 主要来源于 apply_penalty_multi_scores 步骤,两次请求的输入仅 sampling_metadata.pre_token_ids 存在差异。
其中,
可以看到,pre_token_ids 在第二条请求推理时没有重置为 -1。
查看 custom_ops/gpu_ops/token_penalty_multi_scores.cu 代码,并没有用 cur_len 去 mask 掉后面的无效值,而是依赖 pre_ids[cur_len: ] 被预先置为负数(如 -1),才能保证计算正确性。
而 V1 Scheduler 也没有在请求 prefill 时重置 pre_token_ids 为 -1 的逻辑,该逻辑在 V0 是有的。
解决方法
在 insert_tasks_v1 方法中添加初始化 pre_token_ids 的逻辑。