feat: Support parallel execution for roles with same dependencies#42
feat: Support parallel execution for roles with same dependencies#42cheyang merged 4 commits intosgl-project:mainfrom tlipoca9:main
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the role dependency sorting algorithm to support parallel execution of roles with identical dependencies. The refactoring changes the sorting output from a single sequential list to layered lists, enabling roles at the same dependency level to be processed concurrently, improving startup efficiency for distributed workloads.
Key changes:
- Modified dependency sorting to return layered role groups instead of sequential ordering
- Updated interface signatures to support parallel execution patterns
- Enhanced error handling to use error aggregation within execution layers
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/dependency/interface.go | Updated DependencyManager interface to return layered role specs |
| pkg/dependency/dependency.go | Refactored sorting algorithm from DFS to dependency-depth-based layering |
| pkg/dependency/dependency_test.go | Updated test cases to validate layered sorting and added parallel dependency scenarios |
| internal/controller/workloads/rolebasedgroup_controller.go | Modified reconciler to process role layers with error aggregation |
| internal/controller/workloads/pod_controller.go | Updated pod controller to handle layered role execution |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| for _, roleList := range sortedRoles { | ||
| var joinedError error | ||
|
|
||
| for _, role := range roleList { |
There was a problem hiding this comment.
It seems that we still reconcile roles serially here? Should we switch to using go routines?
There was a problem hiding this comment.
I don't think it's necessary to use goroutines in this context. Considering that the normal reconciliation logic typically executes very quickly, introducing goroutines here seems unnecessary. However, if you believe it's genuinely required, I'm willing to make the necessary modifications.
There was a problem hiding this comment.
I agree with @tlipoca9 because it's eliminating dependency blocking through layered processing rather than using goroutines here.
| want: [][]string{ | ||
| {"c"}, | ||
| {"a", "b"}, | ||
| }, |
There was a problem hiding this comment.
Please add some more complex cases, such as
{
name: "complex dependencies",
dependencies: map[string][]string{
"a": {"c", "b", "d"},
"b": {"c"},
"c": {},
"d": {},
"e": {"f"},
"f": {"d"},
},
want: [][]string{
{"c", "d"},
{"b", "f"},
{"a", "e"},
},
wantErr: false,
},|
|
||
| import ( | ||
| "context" | ||
| stderrors "errors" |
|
Thanks for your contribution! The layered approach you introduced indeed helps reduce the overall readiness time in complex scenarios.
|
|
@tlipoca9 Thank you for optimizing the RBG logic! We truly appreciate your expertise and look forward to your continued contributions to the project. Please don't hesitate to reach out if you have any questions or need support. If possible, Please let me know how to find you by email or slack! |
Thank you so much for your encouraging feedback! I truly appreciate you taking the time to review the layered approach and share those valuable optimization insights. I'm genuinely interested in continuing to contribute to the project, though I'm still learning and might need some guidance along the way. Please feel free to point out any areas where I can improve - I'm always open to feedback! |
@cheyang Thanks! My email and slack is same: [email protected] |
Ⅰ. Motivation
Improve the role dependency sorting algorithm in RoleBasedGroup to support parallel execution of roles at the same dependency level, enhancing startup efficiency for distributed workloads. The original topological sorting algorithm could only execute roles sequentially, unable to leverage parallelism between roles with identical dependencies.
Details
In the current implementation, roles have implicit dependencies based on their definition order. In LLM scenarios, when loading a large model from S3 (typically taking 10-15 minutes), if multiple roles (like prefill role and decode role) need to be started simultaneously, waiting 10-15 minutes to start the second role is inefficient and unreasonable.
Ⅱ. Modifications
dependencyOrderfunction to return layered lists[][]stringinstead of a single sorted list, enabling parallel execution of roles at the same dependency level.DependencyManager.SortRolesinterface to return[][]*workloadsv1alpha1.RoleSpecinstead of[]*workloadsv1alpha1.RoleSpec.PodReconcilerandRoleBasedGroupReconcilerto support layered role execution, using error aggregation instead of immediate error returns.Ⅲ. Does this pull request fix one issue?
NONE
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
go test ./pkg/dependency/... -vVI. Special notes for reviews
stderrors.Jointo aggregate errors within the same level instead of immediate returns.Checklist
make fmt.