Skip to content

[KEP-30]: Introduce InstanceSet Workload Support in RoleBasedGroup for Improved LLM Orchestration#26

Merged
cheyang merged 3 commits intosgl-project:mainfrom
veophi:kep/instanceset
Nov 7, 2025
Merged

[KEP-30]: Introduce InstanceSet Workload Support in RoleBasedGroup for Improved LLM Orchestration#26
cheyang merged 3 commits intosgl-project:mainfrom
veophi:kep/instanceset

Conversation

@veophi
Copy link
Copy Markdown
Contributor

@veophi veophi commented Sep 16, 2025

Ⅰ. Motivation

The current RoleBasedGroup (RBG) relies on multiple external workloads such as Deployment, StatefulSet, and LeaderWorkerSet (LWS).
This dependency limits RBG's extensibility and increases the complexity for users who are not deeply familiar with the Kubernetes workload ecosystem.

To better serve large-model (LLM) inference/training scenarios and reduce user cognitive overhead, this PR proposes introducing InstanceSet as a first-class workload type in RBG.
InstanceSet enables:

  • Orchestration of both single-node and multi-node inference instances.
  • Instance as a minimal orchestration unit, with richer lifecycle control.
  • Features such as in-place update, gang scheduling, and traffic lifecycle binding at instance-level.

Ⅱ. Modifications

Introduce InstanceSet KEP docs:

  • Added support for a new InstanceSet workload type within RBG.
  • Defined Instance concept and CRD abstraction:
    • An Instance can contain one or more Pods, supporting differentiated templates and replica counts.
    • Pod lifecycle binding within an Instance: synchronized start, stop, and traffic drain.
    • Pod affinity/topology settings and gang scheduling support within an Instance.
  • Implemented upgrade policies:
    • Instance-level and Pod-level in-place/recreate upgrade rules based on changed fields.
  • Added feature comparison table between InstanceSet and LWS.
  • Documented user stories covering in-place updates, rolling updates with MaxSurge, etc.

Ⅲ. Does this pull request fix one issue?

fixes #3 #21


Ⅳ. List the added test cases

TBD — Will add unit tests and integration tests for:

  • Instance creation and orchestration
  • In-place upgrade behavior
  • rolling update upgrade behavior
  • gang scheduler and ready policy
  • Field-change based upgrade path validation
    (No backward compatibility tests for switching existing RBG workloads to InstanceSet as it’s out of scope.)

Ⅴ. Describe how to verify it


VI. Special notes for reviews

  • This PR only supports newly created RBGs using InstanceSet and does not support swapping existing RBG workloads to InstanceSet directly.
  • Consider aligning on which RBG API exposure option (Option 1 or Option 2) should be accepted.

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.



#### Option 2: Expose in a Way Compatible with LWS
Keep the RBG API compatible with the existing **LeaderWorkerSet (LWS)** structure and semantics,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer option 2, and I think it can support single template ( via sts before) and lws template (via lws before) while still letting end user opt-in to InstanceSet power behind the same fields.

components:
- name: leader
size: 1
serviceName: deepseek-r1-master
Copy link
Copy Markdown
Collaborator

@cheyang cheyang Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also headless service? And it will be created by InstanceSet controller automatically?

Copy link
Copy Markdown
Contributor

@yangsoon yangsoon Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In the design, serviceName represents the name of the headless service.
  2. In my opinion, since headless services are Kubernetes resources within an LLM application, they are better created by the platform.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In the design, serviceName represents the name of the headless service.
  2. In my opinion, since headless services are Kubernetes resources within an LLM application, they are better created by the platform.

agree

@veophi veophi mentioned this pull request Oct 11, 2025
3 tasks
Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@cheyang cheyang added this to the v0.5.0 milestone Oct 13, 2025
@cheyang cheyang closed this in #52 Oct 15, 2025
@cheyang cheyang reopened this Nov 7, 2025
@cheyang cheyang merged commit be55d31 into sgl-project:main Nov 7, 2025
6 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Development Roadmap (v0.4.0)

4 participants