Skip to content

feat: impl rollout coordination for roles in rbg#91

Merged
cheyang merged 1 commit intosgl-project:mainfrom
veophi:impl/coordination
Nov 19, 2025
Merged

feat: impl rollout coordination for roles in rbg#91
cheyang merged 1 commit intosgl-project:mainfrom
veophi:impl/coordination

Conversation

@veophi
Copy link
Copy Markdown
Contributor

@veophi veophi commented Nov 7, 2025

Ⅰ. Motivation

More details in KEP .

A RBG demo using coordination feature:

kind: RoleBasedGroup
spec:
  coordination:
  - name: pd-rollout
    roles:
    - prefill
    - decode
    strategy:
      rollingUpdate:
        maxSkew: 1%
  roles:
  - name: prefill # Statefulset
     replicas: 7
  - name: decode #StatefulSet
     replicas: 3
  • a normal rolling update case:
    asciicast

  • a decode publish exception case. Prefill will automatically pause the upgrade proportionally.
    asciicast

Ⅱ. Modifications

RBG API:

  • rbg.spec.coordination field is added.
  • rbg.status.roleStatuses[x].updatedReplicas is added.
  • rbg.spec.roles[x].minReadySeconds is added.

RBG Controller:

  • calculate coordination partition step before reconciling roles.
  • update status before reconciling roles.
  • watch the changes of more roles' workload status fileds.

Ⅲ. Does this pull request fix one issue?

fixes #45

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

Comment thread api/workloads/v1alpha1/rolebasedgroup_types.go Outdated
Comment thread pkg/reconciler/instanceset/sync/helper.go Outdated
@RongGu RongGu requested a review from Copilot November 10, 2025 02:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a coordination feature for managing rolling updates across multiple roles in a RoleBasedGroup, enabling synchronized rollout strategies with configurable constraints on replica update skew.

  • Adds coordination API types to specify MaxSkew, Partition, and MaxUnavailable parameters
  • Implements coordination calculation logic to determine rolling update targets while respecting skew constraints
  • Extends role status tracking to include UpdatedReplicas field for coordination decisions

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
api/workloads/v1alpha1/rolebasedgroup_types.go Adds Coordination and CoordinationRolloutStrategy types with MaxSkew, Partition, and MaxUnavailable fields
internal/controller/workloads/rolebasedgroup_controller.go Implements coordination calculation logic including CalculateCoordinationByRole and related helper functions
pkg/utils/utils.go Moves CalculatePartitionReplicas from instanceset/utils for broader reuse
pkg/reconciler/*.go Updates reconcilers to accept coordinationRollout parameter and apply coordination constraints
config/crd/bases/*.yaml Updates CRD schemas to include coordination specifications and updatedReplicas in status
client-go/applyconfiguration/*.go Adds generated apply configurations for coordination types
Comments suppressed due to low confidence (3)

pkg/reconciler/sts_reconciler.go:225

  • The log message says 'case 2' but the comment above says 'Case 3'. This inconsistency makes debugging difficult. Update the log message to 'case 3' to match the comment.
		logger.V(1).Info(fmt.Sprintf("case 2: rolling update started. partition %d, replicas: %d", partition, replicas))

pkg/reconciler/sts_reconciler.go:256

  • The log message says 'case 4' but the comment above says 'Case 5'. This inconsistency makes debugging difficult. Update the log message to 'case 5' to match the comment.
				"case 4: Replicas changed during rolling update. partition %d, replicas: %d", partition, replicas,

pkg/reconciler/sts_reconciler.go:277

  • The log message says 'case 5' but the comment above says 'Case 6'. This inconsistency makes debugging difficult. Update the log message to 'case 6' to match the comment.
			"case 5: Calculating the Partition during rolling update. partition %d, replicas: %d", partition, replicas,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread pkg/reconciler/sts_reconciler.go Outdated
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 30 out of 31 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/controller/workloads/rolebasedgroup_controller.go Outdated
Comment thread api/workloads/v1alpha1/rolebasedgroup_types.go Outdated
@veophi veophi changed the title [WIP]feat: impl rollout coordination for roles in rbg feat: impl rollout coordination for roles in rbg Nov 13, 2025
@RongGu RongGu requested a review from Copilot November 13, 2025 14:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 35 out of 36 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@RongGu RongGu requested a review from Syspretor November 13, 2025 16:59
Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread coordination_demo.cast Outdated
Comment thread coordination_show.cast Outdated
Comment thread internal/controller/workloads/instance_controller.go
Comment thread internal/controller/workloads/instanceset_controller.go
Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread internal/controller/workloads/rolebasedgroup_controller.go
Comment thread api/workloads/v1alpha1/rolebasedgroup_types.go
@cheyang cheyang self-requested a review November 19, 2025 01:28
Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@cheyang cheyang merged commit e4a2d77 into sgl-project:main Nov 19, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Development Roadmap (v0.5.0)

7 participants