Skip to content

Chore(rbg): add v1alpha2 related examples#231

Merged
cheyang merged 6 commits intosgl-project:mainfrom
Syspretor:docs/add-v1alpha2-docs
Mar 31, 2026
Merged

Chore(rbg): add v1alpha2 related examples#231
cheyang merged 6 commits intosgl-project:mainfrom
Syspretor:docs/add-v1alpha2-docs

Conversation

@Syspretor
Copy link
Copy Markdown
Collaborator

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly updates the project's examples and API structure to align with the v1alpha2 API version of RoleBasedGroup (RBG). It introduces a new template specification for RoleBasedGroupSets, provides a comprehensive set of new examples demonstrating advanced v1alpha2 features, and reorganizes existing examples into a deprecated section. These changes aim to improve clarity, provide up-to-date usage patterns, and enhance the overall developer experience by showcasing the full capabilities of the latest API.

Highlights

  • API Structure Update: The RoleBasedGroupSetSpec in v1alpha1 and v1alpha2 APIs was refactored to introduce RoleBasedGroupTemplateSpec and rename the Template field to GroupTemplate. This change provides a more structured way to define RoleBasedGroup templates within a RoleBasedGroupSet.
  • Comprehensive v1alpha2 Examples Added: A wide array of new example YAML files were added to showcase the features of the v1alpha2 API version. These examples cover coordinated policies, engine runtime profiles, various RoleBasedGroup patterns (standalone, custom components, leader-worker), restart policies, scaling adapters, and different scheduling options including exclusive topology and gang scheduling with both scheduler-plugins and Volcano.
  • Example Reorganization and Deprecation: All previous examples were moved into a new deprecated directory. This clearly separates v1alpha1 and v1alpha2 examples, making it easier for users to find relevant configurations for the latest API version.
  • Documentation Alignment: Documentation links within doc/TOC.md, doc/features/exclusive-topology.md, and doc/quick_start.md were updated to reflect the new example paths and the reorganized directory structure.
  • Scheduler Configuration Clarification: Comments in the Kubernetes scheduler-plugins and Volcano scheduler manager files were enhanced to provide clearer guidance on default and configurable gang scheduling implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Syspretor Syspretor force-pushed the docs/add-v1alpha2-docs branch from d278af2 to 33aa574 Compare March 23, 2026 03:05
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant number of examples for the v1alpha2 API and refactors the examples directory structure. While the new examples are valuable, I've identified several issues that need attention. These include broken links in the documentation, inconsistencies between comments and code in YAML examples, a duplicated example file, and a potential service discovery issue in the SGLang inference example. Additionally, there's a critical inconsistency between the v1alpha2 type definitions and the auto-generated code that will cause compilation failures. I've provided specific feedback and suggestions to address these points.

I am having trouble creating individual review comments. Click here to see my feedback.

api/workloads/v1alpha2/zz_generated.deepcopy.go (677-681)

critical

There is a critical inconsistency between the generated DeepCopyInto function for RoleBasedGroupTemplateSpec and its definition in api/workloads/v1alpha2/rolebasedgroupset_types.go. The generated code attempts to copy an ObjectMeta field, but the type definition contains separate Labels and Annotations fields. This will cause a compilation error and indicates that make generate was likely run against a different version of the type definition file.

Please ensure the type definitions are consistent and regenerate the deepcopy functions. The v1alpha1 change to use an embedded metav1.ObjectMeta in RoleBasedGroupTemplateSpec is a good pattern to follow for v1alpha2 as well for consistency and correctness.

api/workloads/v1alpha1/rolebasedgroupset_types.go (23-32)

high

While this change to use metav1.ObjectMeta is a good improvement for v1alpha1, it seems inconsistent with the type definition for v1alpha2 as seen in api/workloads/v1alpha2/rolebasedgroupset_types.go. The v1alpha2 version uses separate Labels and Annotations fields. For consistency between API versions, it would be best if v1alpha2 also adopted the embedded metav1.ObjectMeta pattern. This is also related to a compilation issue I've noted on api/workloads/v1alpha2/zz_generated.deepcopy.go.

examples/inference/pd-disagg-standalone.yaml (68-71)

high

The --prefill and --decode arguments for the router container are hardcoded to point to the first pod of each role (e.g., sglang-pd-inference-prefill-0...). This approach is not robust for scaling, as the router will not be aware of any additional replicas created by HPA. Using the Kubernetes service DNS name (e.g., sglang-pd-inference-prefill:8001) would be a more scalable approach, allowing for load balancing across all available pods.

                  - --prefill
                  - "sglang-pd-inference-prefill:8001"
                  - --decode
                  - "sglang-pd-inference-decode:8002"

doc/TOC.md (33-34)

medium

The links to the PD-Disagg examples appear to be broken. The files examples/pd-disagg/sglang/sgl.md and examples/pd-disagg/dynamo/README.md were moved to the examples/deprecated/ directory, but the new links point to a non-existent path under examples/inference/pd-disagg/. Please update these links to point to the correct file locations.

doc/features/exclusive-topology.md (39)

medium

The link to rbgs-exclusive-topology.yaml appears to be broken. The path was updated to ../../examples/workload/rbgs/exclusive-topology.yaml, but based on the new file structure, the correct path seems to be ../../examples/basic/rbgs/rbgs-exclusive-topology.yaml. Please correct the link.

on a topology domain. See example [rbgs-exclusive-topology.yaml](../../examples/basic/rbgs/exclusive-topology.yaml).

examples/basic/coordinated-policy/coordinated-scaling.yaml (131-132)

medium

There is a discrepancy between the comment and the code. The comment on line 131 indicates that OrderReady is used for safe scaling, but the progression field on line 132 is set to OrderScheduled. To align with the comment's intent, please change this to OrderReady.

          # OrderReady: wait for all pods in batch to be ready before next batch
          progression: OrderReady

examples/basic/rbg/base.yaml (1-32)

medium

This file, examples/basic/rbg/base.yaml, has content identical to examples/basic/rbg/patterns/standalone-pattern.yaml. Having duplicate examples can be confusing and increases maintenance overhead. Please consider removing one of them to avoid redundancy.

examples/basic/rbg/role-temlate/rbg-with-roletemplates.yaml (1)

medium

There appears to be a typo in the directory name: role-temlate should be role-template. Please correct the directory name.

examples/basic/rbg/scheduling/scheduler-plugins-gang.yaml (16-18)

medium

The comment on line 9 states that the rbg.workloads.x-k8s.io/group-gang-scheduling: "true" annotation is required to enable gang scheduling, but it is commented out in the example. This is confusing for users. If the annotation is required, it should be uncommented.

  annotations:
    rbg.workloads.x-k8s.io/group-gang-scheduling: "true"
    # rbg.workloads.x-k8s.io/group-gang-scheduling-timeout: "120"

examples/basic/rbg/scheduling/volcano-gang.yaml (18-21)

medium

The comment on line 10 states that the rbg.workloads.x-k8s.io/group-gang-scheduling: "true" annotation is required to enable Volcano gang scheduling, but it is commented out in the example. This is confusing. If the annotation is required, it should be uncommented to provide a working example.

  annotations:
    rbg.workloads.x-k8s.io/group-gang-scheduling: "true"
    # rbg.workloads.x-k8s.io/group-gang-scheduling-volcano-queue: "default"
    # rbg.workloads.x-k8s.io/group-gang-scheduling-volcano-priority: "high-priority"

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the repository’s v1alpha2 documentation and example manifests, particularly around inference deployments and advanced RBG features (gang scheduling, coordinated policies, scaling adapter, patterns), while also adding a set of deprecated v1alpha1-era example manifests for reference.

Changes:

  • Added/updated a broad set of v1alpha2 example YAMLs (RBG/RBGS patterns, update strategies, scheduling, scaling, coordinated policy, engine runtime profiles, inference PD-disagg).
  • Added a deprecated examples tree containing prior v1alpha1 and related manifests (single-node, multi-node, PD-disagg, mooncake, monitoring, model download, InstanceSet examples).
  • Updated docs to reference new/relocated examples and added clarifying package-level comments for scheduler implementations.

Reviewed changes

Copilot reviewed 25 out of 68 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
pkg/scheduler/volcano/manager.go Adds package-level guidance on enabling Volcano gang scheduling via controller/Helm configuration.
pkg/scheduler/k8s-scheduler-plugin/manager.go Adds package-level guidance noting scheduler-plugins as default and how to switch to Volcano.
examples/inference/agg-standalone.yaml Added placeholder (currently empty) aggregated inference example manifest.
examples/inference/agg-leader-worker.yaml Added placeholder (currently empty) aggregated leader/worker inference example manifest.
examples/inference/pd-disagg-standalone.yaml Adds a v1alpha2 PD-disaggregated SGLang standalone-pattern example (router/prefill/decode) + CoordinatedPolicy.
examples/inference/pd-disagg-leader-worker.yaml Adds a v1alpha2 PD-disaggregated leaderWorkerPattern example with roleTemplates.
examples/deprecated/single-node/vllm.yaml Adds deprecated single-node vLLM StatefulSet + Service example.
examples/deprecated/single-node/sglang.yaml Adds deprecated single-node SGLang StatefulSet + Service example.
examples/deprecated/rbgs/rbgs-base.yaml Adds deprecated v1alpha1 RoleBasedGroupSet base example.
examples/deprecated/rbgs/exclusive-topology.yaml Adds deprecated v1alpha1 RoleBasedGroupSet exclusive topology example.
examples/deprecated/pd-disagg/sglang/sglang-pd.yaml Adds deprecated v1alpha1 PD-disagg SGLang example using Patio runtime profile.
examples/deprecated/pd-disagg/sglang/sgl.md Adds deprecated markdown walkthrough for SGLang PD-disagg.
examples/deprecated/pd-disagg/sglang/patio-profile.yaml Adds deprecated ClusterEngineRuntimeProfile example for Patio runtime.
examples/deprecated/pd-disagg/sglang/model.yaml Adds deprecated OSS-backed PV/PVC + secret example for model distribution.
examples/deprecated/pd-disagg/dynamo/nats.yaml Adds deprecated NATS deployment/service dependency for Dynamo PD-disagg example.
examples/deprecated/pd-disagg/dynamo/etcd.yaml Adds deprecated etcd StatefulSet/service dependency for Dynamo PD-disagg example.
examples/deprecated/pd-disagg/dynamo/dynamo.yaml Adds deprecated v1alpha1 Dynamo PD-disagg RoleBasedGroup + Service example.
examples/deprecated/pd-disagg/dynamo/README.md Adds deprecated markdown walkthrough for Dynamo PD-disagg.
examples/deprecated/multi-nodes/vllm.yaml Adds deprecated multi-node vLLM LeaderWorkerSet example.
examples/deprecated/multi-nodes/sglang.yaml Adds deprecated multi-node SGLang LeaderWorkerSet example.
examples/deprecated/mooncake/pd-disaggregated-with-mooncake.yaml Adds deprecated v1alpha1 PD-disagg Mooncake-based example.
examples/deprecated/mooncake/aggregated-with-mooncake.yaml Adds deprecated v1alpha1 aggregated Mooncake-based example.
examples/deprecated/mooncake/Dockerfile.mooncake Adds deprecated Dockerfile to build a Mooncake-enabled base environment.
examples/deprecated/monitoring/podmonitor.yaml Adds deprecated PodMonitor example for Prometheus scraping.
examples/deprecated/model-download/model-download-job.yaml Adds deprecated job + RBAC example for model download/bootstrap.
examples/deprecated/instanceset/instanceset.yaml Adds deprecated v1alpha1 InstanceSet example.
examples/deprecated/instanceset/instanceset-with-template.yaml Adds deprecated v1alpha1 RBG with InstanceSet workload example.
examples/deprecated/instanceset/instanceset-with-lws.yaml Adds deprecated v1alpha1 RBG with InstanceSet + leaderWorkerSet patching example.
examples/deprecated/instanceset/instanceset-with-components.yaml Adds deprecated v1alpha1 RBG components example.
examples/deprecated/instanceset/instance.yaml Adds deprecated v1alpha1 Instance example manifest.
examples/deprecated/coordination/scaling/coordination-scaling-simple.yaml Adds deprecated v1alpha1 inline coordination scaling example.
examples/deprecated/coordination/scaling/coordination-scaling-order-ready.yaml Adds deprecated v1alpha1 inline coordination scaling (OrderReady) example.
examples/deprecated/coordination/scaling/coordination-scaling-fast.yaml Adds deprecated v1alpha1 inline coordination scaling (fast) example.
examples/deprecated/basics/volcano-gang.yaml Adds deprecated v1alpha1 Volcano gang scheduling example.
examples/deprecated/basics/scheduler-plugins-gang.yaml Adds deprecated v1alpha1 scheduler-plugins gang scheduling example.
examples/deprecated/basics/rolling-update.yaml Adds deprecated v1alpha1 rolling update example.
examples/deprecated/basics/rolling-update-with-partition.yaml Adds deprecated v1alpha1 rolling update with partition example.
examples/deprecated/basics/restart-policy.yaml Adds deprecated v1alpha1 restart policy example.
examples/deprecated/basics/rbg-with-roletemplates.yaml Adds deprecated v1alpha1 roleTemplates usage example.
examples/deprecated/basics/rbg-with-lws-workload.yaml Adds deprecated v1alpha1 LWS workload example.
examples/deprecated/basics/rbg-base.yaml Adds deprecated v1alpha1 base multi-role example.
examples/deprecated/basics/exclusive-topology.yaml Adds deprecated v1alpha1 exclusive topology example.
examples/deprecated/basics/coordination.yaml Adds deprecated v1alpha1 inline coordination rolling update example.
examples/basic/rbgs/rbgs-exclusive-topology.yaml Adds v1alpha2 RBGS exclusive topology example using groupTemplate.
examples/basic/rbgs/rbgs-base.yaml Adds v1alpha2 RBGS base example using groupTemplate.
examples/basic/rbg/update-strategy/rolling-update.yaml Adds v1alpha2 rolling update strategy example (RecreatePod/InPlaceIfPossible/InPlaceOnly).
examples/basic/rbg/update-strategy/rolling-update-with-partition.yaml Adds v1alpha2 rolling update partition + paused rollout example.
examples/basic/rbg/scheduling/volcano-gang.yaml Adds v1alpha2 Volcano gang scheduling example + required controller configuration notes.
examples/basic/rbg/scheduling/scheduler-plugins-gang.yaml Adds v1alpha2 scheduler-plugins gang scheduling example + notes.
examples/basic/rbg/scheduling/exclusive-topology.yaml Adds v1alpha2 RBG exclusive topology example.
examples/basic/rbg/scaling/scaling-adapter-with-hpa.yaml Adds v1alpha2 scaling adapter examples with HPA targeting adapters.
examples/basic/rbg/role-temlate/rbg-with-roletemplates.yaml Adds v1alpha2 roleTemplates + templateRef.patch example.
examples/basic/rbg/restart-policy/restart-policy.yaml Adds v1alpha2 restart policy example for multiple roles.
examples/basic/rbg/patterns/standalone-pattern.yaml Adds v1alpha2 standalonePattern example.
examples/basic/rbg/patterns/leader-worker-pattern.yaml Adds v1alpha2 leaderWorkerPattern example (plus a templateRef-based variant).
examples/basic/rbg/patterns/custom-components-pattern.yaml Adds v1alpha2 customComponentsPattern example.
examples/basic/rbg/dependency/role-dependencies.yaml Adds v1alpha2 role dependency examples (router-workers, PD chain, master-workers).
examples/basic/rbg/base.yaml Adds v1alpha2 base multi-role example.
examples/basic/engine-runtime/engine-runtime-profile.yaml Adds v1alpha2 ClusterEngineRuntimeProfile example + injection via engineRuntimes.
examples/basic/coordinated-policy/coordinated-scaling.yaml Adds v1alpha2 coordinated scaling examples via CoordinatedPolicy CRD.
examples/basic/coordinated-policy/coordinated-rolling-update.yaml Adds v1alpha2 coordinated rolling update example via CoordinatedPolicy CRD.
doc/quick_start.md Updates PD-disagg example links.
doc/features/exclusive-topology.md Updates RBGS exclusive topology example link.
doc/TOC.md Updates PD-disagg example links in the TOC.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/inference/pd-disagg-leader-worker.yaml Outdated
Comment thread examples/basic/rbg/patterns/leader-worker-pattern.yaml
Comment thread examples/basic/coordinated-policy/coordinated-scaling.yaml Outdated
Comment thread doc/quick_start.md Outdated
Comment thread doc/TOC.md Outdated
Comment thread doc/features/exclusive-topology.md Outdated
Comment thread examples/inference/pd-disagg-standalone.yaml Outdated
Comment thread examples/inference/pd-disagg-leader-worker.yaml Outdated
Comment thread doc/features/exclusive-topology.md Outdated
@Syspretor Syspretor force-pushed the docs/add-v1alpha2-docs branch from 6648130 to 37ff926 Compare March 24, 2026 06:16
Comment thread examples/inference/agg-standalone.yaml
Comment thread examples/inference/agg-leader-worker.yaml Outdated
Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@cheyang cheyang merged commit 76d2b18 into sgl-project:main Mar 31, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants