Skip to content

feat: Support parallel execution for roles with same dependencies#42

Merged
cheyang merged 4 commits intosgl-project:mainfrom
tlipoca9:main
Oct 8, 2025
Merged

feat: Support parallel execution for roles with same dependencies#42
cheyang merged 4 commits intosgl-project:mainfrom
tlipoca9:main

Conversation

@tlipoca9
Copy link
Copy Markdown
Contributor

Ⅰ. Motivation

Improve the role dependency sorting algorithm in RoleBasedGroup to support parallel execution of roles at the same dependency level, enhancing startup efficiency for distributed workloads. The original topological sorting algorithm could only execute roles sequentially, unable to leverage parallelism between roles with identical dependencies.

Details

In the current implementation, roles have implicit dependencies based on their definition order. In LLM scenarios, when loading a large model from S3 (typically taking 10-15 minutes), if multiple roles (like prefill role and decode role) need to be started simultaneously, waiting 10-15 minutes to start the second role is inefficient and unreasonable.

Ⅱ. Modifications

  1. Dependency Sorting Algorithm Refactor: Modified dependencyOrder function to return layered lists [][]string instead of a single sorted list, enabling parallel execution of roles at the same dependency level.
  2. Interface Changes: Updated DependencyManager.SortRoles interface to return [][]*workloadsv1alpha1.RoleSpec instead of []*workloadsv1alpha1.RoleSpec.
  3. Controller Adaptation: Modified PodReconciler and RoleBasedGroupReconciler to support layered role execution, using error aggregation instead of immediate error returns.
  4. Enhanced Test Cases: Added test cases for identical dependencies to validate parallel execution correctness.

Ⅲ. Does this pull request fix one issue?

NONE

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

  • Added "same dependency" test cases to verify multiple roles with identical dependencies are correctly assigned to the same execution level.
  • Updated existing test case expected results to align with the new layered sorting algorithm.

Ⅴ. Describe how to verify it

  1. Run unit tests: go test ./pkg/dependency/... -v
  2. Create RoleBasedGroup instances containing multiple roles with identical dependencies, verify roles can start in parallel.
  3. Check controller logs to confirm roles at the same level are processed concurrently.

VI. Special notes for reviews

  • Algorithm changed from depth-first search to dependency-depth-based level assignment.
  • Uses stderrors.Join to aggregate errors within the same level instead of immediate returns.
  • Maintains backward-compatible API design while extending return types.

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.

@tlipoca9 tlipoca9 changed the title feat: add order feat: Support parallel execution for roles with same dependencies Sep 25, 2025
@gujingit gujingit requested a review from Copilot September 26, 2025 04:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the role dependency sorting algorithm to support parallel execution of roles with identical dependencies. The refactoring changes the sorting output from a single sequential list to layered lists, enabling roles at the same dependency level to be processed concurrently, improving startup efficiency for distributed workloads.

Key changes:

  • Modified dependency sorting to return layered role groups instead of sequential ordering
  • Updated interface signatures to support parallel execution patterns
  • Enhanced error handling to use error aggregation within execution layers

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/dependency/interface.go Updated DependencyManager interface to return layered role specs
pkg/dependency/dependency.go Refactored sorting algorithm from DFS to dependency-depth-based layering
pkg/dependency/dependency_test.go Updated test cases to validate layered sorting and added parallel dependency scenarios
internal/controller/workloads/rolebasedgroup_controller.go Modified reconciler to process role layers with error aggregation
internal/controller/workloads/pod_controller.go Updated pod controller to handle layered role execution

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread pkg/dependency/dependency.go
for _, roleList := range sortedRoles {
var joinedError error

for _, role := range roleList {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we still reconcile roles serially here? Should we switch to using go routines?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to use goroutines in this context. Considering that the normal reconciliation logic typically executes very quickly, introducing goroutines here seems unnecessary. However, if you believe it's genuinely required, I'm willing to make the necessary modifications.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tlipoca9 because it's eliminating dependency blocking through layered processing rather than using goroutines here.

Comment thread internal/controller/workloads/pod_controller.go Outdated
Comment thread internal/controller/workloads/pod_controller.go Outdated
Comment thread internal/controller/workloads/rolebasedgroup_controller.go Outdated
@tlipoca9 tlipoca9 requested review from bcfre and gujingit September 28, 2025 06:44
want: [][]string{
{"c"},
{"a", "b"},
},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some more complex cases, such as

{
			name: "complex dependencies",
			dependencies: map[string][]string{
				"a": {"c", "b", "d"},
				"b": {"c"},
				"c": {},
				"d": {},
				"e": {"f"},
				"f": {"d"},
			},
			want: [][]string{
				{"c", "d"},
				{"b", "f"},
				{"a", "e"},
			},
			wantErr: false,
		},


import (
"context"
stderrors "errors"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the rename

@bcfre
Copy link
Copy Markdown
Collaborator

bcfre commented Sep 30, 2025

Thanks for your contribution! The layered approach you introduced indeed helps reduce the overall readiness time in complex scenarios.
Going forward, we might explore a couple of additional optimizations. If you’re interested, we’d be happy to see more contributions from you:

  • If the “not ready” error is only due to a specific Role, we should still update the RBG status for the Roles that are ready.
  • Roles across different layers may not always depend on each other. For example, in a case where a depends on b and c depends on d — if b is ready but d is not, we could still proceed with reconciling a instead of blocking both a and c. A possible implementation reference is in the JobSet controller

@cheyang
Copy link
Copy Markdown
Collaborator

cheyang commented Oct 8, 2025

@tlipoca9 Thank you for optimizing the RBG logic! We truly appreciate your expertise and look forward to your continued contributions to the project. Please don't hesitate to reach out if you have any questions or need support. If possible, Please let me know how to find you by email or slack!

@cheyang cheyang merged commit bfb723c into sgl-project:main Oct 8, 2025
3 checks passed
@cheyang cheyang added this to the v0.5.0 milestone Oct 9, 2025
@tlipoca9
Copy link
Copy Markdown
Contributor Author

tlipoca9 commented Oct 9, 2025

Thanks for your contribution! The layered approach you introduced indeed helps reduce the overall readiness time in complex scenarios. Going forward, we might explore a couple of additional optimizations. If you’re interested, we’d be happy to see more contributions from you:

  • If the “not ready” error is only due to a specific Role, we should still update the RBG status for the Roles that are ready.
  • Roles across different layers may not always depend on each other. For example, in a case where a depends on b and c depends on d — if b is ready but d is not, we could still proceed with reconciling a instead of blocking both a and c. A possible implementation reference is in the JobSet controller

Thank you so much for your encouraging feedback! I truly appreciate you taking the time to review the layered approach and share those valuable optimization insights. I'm genuinely interested in continuing to contribute to the project, though I'm still learning and might need some guidance along the way. Please feel free to point out any areas where I can improve - I'm always open to feedback!

@tlipoca9
Copy link
Copy Markdown
Contributor Author

tlipoca9 commented Oct 9, 2025

@tlipoca9 Thank you for optimizing the RBG logic! We truly appreciate your expertise and look forward to your continued contributions to the project. Please don't hesitate to reach out if you have any questions or need support. If possible, Please let me know how to find you by email or slack!

@cheyang Thanks! My email and slack is same: [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants