Skip to content

Conversation

@zeroRains
Copy link
Contributor

@zeroRains zeroRains commented Jul 28, 2025

pcard-71500

问题描述:pd分离的ep+dp8+tp1场景无法跑通

问题分析:cudagraph执行依赖于kv_cache的初始化,在之前的PR #2924 中调整了服务启动逻辑。但在EP场景下的kv_cache需要由expert_service启动cache_manager,实现kv_cache初始化,这部分在之前的PR中没有考虑expert_service的启动,因此本PR需要在启动逻辑部分集成expert_service启动。

具体修改内容如下:

  1. 本PR新增了一个同步信号量loaded_model_signal,标记各个worker完成model loading,然后再执行expert_service启动。
  2. 同时增加一个同步信号量launched_expert_service_signal,确保每个expert_service启动完成。
  3. 变更同步信号量launched_cache_manager_signal(通知worker cache_manager启动完成可以开始初始化kv_cache)生效位置,原本只要启动好prefix-cache的cache_manager就设置该信号,现在变更为prefix-cache和expert_service都启动完成后,才设置该信号量。
  4. 切换启动方式,使用子线程监控worker启动进度(启动服务时的两个进度条),主线程里等待模型加载完成信号,并在接到信号后执行其他组件的启动(Scheduler, CacheManager, ExpertService等)。

基于上述改动,实现在ep场景下worker执行kv_cache初始化之前,启动好所有的cache_manager。

实现进度:

  1. 可以单独启动P节点或D节点
  2. 启动PD后实现单条推理
    2.1 在2.1分支上将修改cp过去可以跑通DP分离+EP+cudagraph,并能完成单条推理。
    2.2 在develop分支上目前有其他部分产生的bug,后续会跟进其他修复PR如:[BugFix] num_seqs #3291

@paddle-bot
Copy link

paddle-bot bot commented Jul 28, 2025

Thanks for your contribution!

if self.cfg.scheduler_config.name == "splitwise":
self.scheduler.start(role, host_ip, disaggregate)

time.sleep(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么需要 sleep?安全吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前代码写的,没有动。我看了一下,这个scheduler里面要启动APIScheduler和InferScheduler。两个启动的时候都要去连接Redis的对象,其次就是启动线程进行事件循环。我感觉sleep应该没有必要,两个函数连接Redis的操作是主进程在做的,只有启动事件循环才会用线程,应该不影响。 也请@ltd0924 确认一下。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

启动子进程需要等待

1,
self.cfg.parallel_config.data_parallel_size // self.cfg.nnode,
):
time.sleep(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@gongshaotian
Copy link
Collaborator

PR 描述记得更新下~

@zeroRains
Copy link
Contributor Author

zeroRains commented Aug 8, 2025

PR 描述记得更新下~

具体修改内容的部分已经更新了,实现进度暂时得保留一下,现在是在2.1分支上把我的修改cp过去跑通的,develop分支能否跑通还得再请 @Wanglongzhi2001 有空的时候帮忙验证一下~

yuanlehome
yuanlehome previously approved these changes Aug 8, 2025
gongshaotian
gongshaotian previously approved these changes Aug 8, 2025
Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@gongshaotian gongshaotian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit b23af29 into PaddlePaddle:develop Aug 11, 2025
12 of 15 checks passed
@zeroRains zeroRains deleted the ep_service branch August 11, 2025 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants