-
Notifications
You must be signed in to change notification settings - Fork 683
Launch expert_service before kv_cache initialization in worker_process #3045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
| if self.cfg.scheduler_config.name == "splitwise": | ||
| self.scheduler.start(role, host_ip, disaggregate) | ||
|
|
||
| time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么需要 sleep?安全吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前代码写的,没有动。我看了一下,这个scheduler里面要启动APIScheduler和InferScheduler。两个启动的时候都要去连接Redis的对象,其次就是启动线程进行事件循环。我感觉sleep应该没有必要,两个函数连接Redis的操作是主进程在做的,只有启动事件循环才会用线程,应该不影响。 也请@ltd0924 确认一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
启动子进程需要等待
| 1, | ||
| self.cfg.parallel_config.data_parallel_size // self.cfg.nnode, | ||
| ): | ||
| time.sleep(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
|
PR 描述记得更新下~ |
具体修改内容的部分已经更新了,实现进度暂时得保留一下,现在是在2.1分支上把我的修改cp过去跑通的,develop分支能否跑通还得再请 @Wanglongzhi2001 有空的时候帮忙验证一下~ |
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
d66e845 to
caca035
Compare
gongshaotian
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pcard-71500
问题描述:pd分离的ep+dp8+tp1场景无法跑通
问题分析:cudagraph执行依赖于kv_cache的初始化,在之前的PR #2924 中调整了服务启动逻辑。但在EP场景下的kv_cache需要由expert_service启动cache_manager,实现kv_cache初始化,这部分在之前的PR中没有考虑expert_service的启动,因此本PR需要在启动逻辑部分集成expert_service启动。
具体修改内容如下:
基于上述改动,实现在ep场景下worker执行kv_cache初始化之前,启动好所有的cache_manager。
实现进度:
2.1 在2.1分支上将修改cp过去可以跑通DP分离+EP+cudagraph,并能完成单条推理。
2.2 在develop分支上目前有其他部分产生的bug,后续会跟进其他修复PR如:[BugFix] num_seqs #3291