Chore(rbg): add v1alpha2 related examples#231
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly updates the project's examples and API structure to align with the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
d278af2 to
33aa574
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a significant number of examples for the v1alpha2 API and refactors the examples directory structure. While the new examples are valuable, I've identified several issues that need attention. These include broken links in the documentation, inconsistencies between comments and code in YAML examples, a duplicated example file, and a potential service discovery issue in the SGLang inference example. Additionally, there's a critical inconsistency between the v1alpha2 type definitions and the auto-generated code that will cause compilation failures. I've provided specific feedback and suggestions to address these points.
I am having trouble creating individual review comments. Click here to see my feedback.
api/workloads/v1alpha2/zz_generated.deepcopy.go (677-681)
There is a critical inconsistency between the generated DeepCopyInto function for RoleBasedGroupTemplateSpec and its definition in api/workloads/v1alpha2/rolebasedgroupset_types.go. The generated code attempts to copy an ObjectMeta field, but the type definition contains separate Labels and Annotations fields. This will cause a compilation error and indicates that make generate was likely run against a different version of the type definition file.
Please ensure the type definitions are consistent and regenerate the deepcopy functions. The v1alpha1 change to use an embedded metav1.ObjectMeta in RoleBasedGroupTemplateSpec is a good pattern to follow for v1alpha2 as well for consistency and correctness.
api/workloads/v1alpha1/rolebasedgroupset_types.go (23-32)
While this change to use metav1.ObjectMeta is a good improvement for v1alpha1, it seems inconsistent with the type definition for v1alpha2 as seen in api/workloads/v1alpha2/rolebasedgroupset_types.go. The v1alpha2 version uses separate Labels and Annotations fields. For consistency between API versions, it would be best if v1alpha2 also adopted the embedded metav1.ObjectMeta pattern. This is also related to a compilation issue I've noted on api/workloads/v1alpha2/zz_generated.deepcopy.go.
examples/inference/pd-disagg-standalone.yaml (68-71)
The --prefill and --decode arguments for the router container are hardcoded to point to the first pod of each role (e.g., sglang-pd-inference-prefill-0...). This approach is not robust for scaling, as the router will not be aware of any additional replicas created by HPA. Using the Kubernetes service DNS name (e.g., sglang-pd-inference-prefill:8001) would be a more scalable approach, allowing for load balancing across all available pods.
- --prefill
- "sglang-pd-inference-prefill:8001"
- --decode
- "sglang-pd-inference-decode:8002"doc/TOC.md (33-34)
The links to the PD-Disagg examples appear to be broken. The files examples/pd-disagg/sglang/sgl.md and examples/pd-disagg/dynamo/README.md were moved to the examples/deprecated/ directory, but the new links point to a non-existent path under examples/inference/pd-disagg/. Please update these links to point to the correct file locations.
doc/features/exclusive-topology.md (39)
The link to rbgs-exclusive-topology.yaml appears to be broken. The path was updated to ../../examples/workload/rbgs/exclusive-topology.yaml, but based on the new file structure, the correct path seems to be ../../examples/basic/rbgs/rbgs-exclusive-topology.yaml. Please correct the link.
on a topology domain. See example [rbgs-exclusive-topology.yaml](../../examples/basic/rbgs/exclusive-topology.yaml).
examples/basic/coordinated-policy/coordinated-scaling.yaml (131-132)
There is a discrepancy between the comment and the code. The comment on line 131 indicates that OrderReady is used for safe scaling, but the progression field on line 132 is set to OrderScheduled. To align with the comment's intent, please change this to OrderReady.
# OrderReady: wait for all pods in batch to be ready before next batch
progression: OrderReadyexamples/basic/rbg/base.yaml (1-32)
This file, examples/basic/rbg/base.yaml, has content identical to examples/basic/rbg/patterns/standalone-pattern.yaml. Having duplicate examples can be confusing and increases maintenance overhead. Please consider removing one of them to avoid redundancy.
examples/basic/rbg/role-temlate/rbg-with-roletemplates.yaml (1)
There appears to be a typo in the directory name: role-temlate should be role-template. Please correct the directory name.
examples/basic/rbg/scheduling/scheduler-plugins-gang.yaml (16-18)
The comment on line 9 states that the rbg.workloads.x-k8s.io/group-gang-scheduling: "true" annotation is required to enable gang scheduling, but it is commented out in the example. This is confusing for users. If the annotation is required, it should be uncommented.
annotations:
rbg.workloads.x-k8s.io/group-gang-scheduling: "true"
# rbg.workloads.x-k8s.io/group-gang-scheduling-timeout: "120"examples/basic/rbg/scheduling/volcano-gang.yaml (18-21)
The comment on line 10 states that the rbg.workloads.x-k8s.io/group-gang-scheduling: "true" annotation is required to enable Volcano gang scheduling, but it is commented out in the example. This is confusing. If the annotation is required, it should be uncommented to provide a working example.
annotations:
rbg.workloads.x-k8s.io/group-gang-scheduling: "true"
# rbg.workloads.x-k8s.io/group-gang-scheduling-volcano-queue: "default"
# rbg.workloads.x-k8s.io/group-gang-scheduling-volcano-priority: "high-priority"There was a problem hiding this comment.
Pull request overview
This PR expands the repository’s v1alpha2 documentation and example manifests, particularly around inference deployments and advanced RBG features (gang scheduling, coordinated policies, scaling adapter, patterns), while also adding a set of deprecated v1alpha1-era example manifests for reference.
Changes:
- Added/updated a broad set of v1alpha2 example YAMLs (RBG/RBGS patterns, update strategies, scheduling, scaling, coordinated policy, engine runtime profiles, inference PD-disagg).
- Added a deprecated examples tree containing prior v1alpha1 and related manifests (single-node, multi-node, PD-disagg, mooncake, monitoring, model download, InstanceSet examples).
- Updated docs to reference new/relocated examples and added clarifying package-level comments for scheduler implementations.
Reviewed changes
Copilot reviewed 25 out of 68 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/volcano/manager.go | Adds package-level guidance on enabling Volcano gang scheduling via controller/Helm configuration. |
| pkg/scheduler/k8s-scheduler-plugin/manager.go | Adds package-level guidance noting scheduler-plugins as default and how to switch to Volcano. |
| examples/inference/agg-standalone.yaml | Added placeholder (currently empty) aggregated inference example manifest. |
| examples/inference/agg-leader-worker.yaml | Added placeholder (currently empty) aggregated leader/worker inference example manifest. |
| examples/inference/pd-disagg-standalone.yaml | Adds a v1alpha2 PD-disaggregated SGLang standalone-pattern example (router/prefill/decode) + CoordinatedPolicy. |
| examples/inference/pd-disagg-leader-worker.yaml | Adds a v1alpha2 PD-disaggregated leaderWorkerPattern example with roleTemplates. |
| examples/deprecated/single-node/vllm.yaml | Adds deprecated single-node vLLM StatefulSet + Service example. |
| examples/deprecated/single-node/sglang.yaml | Adds deprecated single-node SGLang StatefulSet + Service example. |
| examples/deprecated/rbgs/rbgs-base.yaml | Adds deprecated v1alpha1 RoleBasedGroupSet base example. |
| examples/deprecated/rbgs/exclusive-topology.yaml | Adds deprecated v1alpha1 RoleBasedGroupSet exclusive topology example. |
| examples/deprecated/pd-disagg/sglang/sglang-pd.yaml | Adds deprecated v1alpha1 PD-disagg SGLang example using Patio runtime profile. |
| examples/deprecated/pd-disagg/sglang/sgl.md | Adds deprecated markdown walkthrough for SGLang PD-disagg. |
| examples/deprecated/pd-disagg/sglang/patio-profile.yaml | Adds deprecated ClusterEngineRuntimeProfile example for Patio runtime. |
| examples/deprecated/pd-disagg/sglang/model.yaml | Adds deprecated OSS-backed PV/PVC + secret example for model distribution. |
| examples/deprecated/pd-disagg/dynamo/nats.yaml | Adds deprecated NATS deployment/service dependency for Dynamo PD-disagg example. |
| examples/deprecated/pd-disagg/dynamo/etcd.yaml | Adds deprecated etcd StatefulSet/service dependency for Dynamo PD-disagg example. |
| examples/deprecated/pd-disagg/dynamo/dynamo.yaml | Adds deprecated v1alpha1 Dynamo PD-disagg RoleBasedGroup + Service example. |
| examples/deprecated/pd-disagg/dynamo/README.md | Adds deprecated markdown walkthrough for Dynamo PD-disagg. |
| examples/deprecated/multi-nodes/vllm.yaml | Adds deprecated multi-node vLLM LeaderWorkerSet example. |
| examples/deprecated/multi-nodes/sglang.yaml | Adds deprecated multi-node SGLang LeaderWorkerSet example. |
| examples/deprecated/mooncake/pd-disaggregated-with-mooncake.yaml | Adds deprecated v1alpha1 PD-disagg Mooncake-based example. |
| examples/deprecated/mooncake/aggregated-with-mooncake.yaml | Adds deprecated v1alpha1 aggregated Mooncake-based example. |
| examples/deprecated/mooncake/Dockerfile.mooncake | Adds deprecated Dockerfile to build a Mooncake-enabled base environment. |
| examples/deprecated/monitoring/podmonitor.yaml | Adds deprecated PodMonitor example for Prometheus scraping. |
| examples/deprecated/model-download/model-download-job.yaml | Adds deprecated job + RBAC example for model download/bootstrap. |
| examples/deprecated/instanceset/instanceset.yaml | Adds deprecated v1alpha1 InstanceSet example. |
| examples/deprecated/instanceset/instanceset-with-template.yaml | Adds deprecated v1alpha1 RBG with InstanceSet workload example. |
| examples/deprecated/instanceset/instanceset-with-lws.yaml | Adds deprecated v1alpha1 RBG with InstanceSet + leaderWorkerSet patching example. |
| examples/deprecated/instanceset/instanceset-with-components.yaml | Adds deprecated v1alpha1 RBG components example. |
| examples/deprecated/instanceset/instance.yaml | Adds deprecated v1alpha1 Instance example manifest. |
| examples/deprecated/coordination/scaling/coordination-scaling-simple.yaml | Adds deprecated v1alpha1 inline coordination scaling example. |
| examples/deprecated/coordination/scaling/coordination-scaling-order-ready.yaml | Adds deprecated v1alpha1 inline coordination scaling (OrderReady) example. |
| examples/deprecated/coordination/scaling/coordination-scaling-fast.yaml | Adds deprecated v1alpha1 inline coordination scaling (fast) example. |
| examples/deprecated/basics/volcano-gang.yaml | Adds deprecated v1alpha1 Volcano gang scheduling example. |
| examples/deprecated/basics/scheduler-plugins-gang.yaml | Adds deprecated v1alpha1 scheduler-plugins gang scheduling example. |
| examples/deprecated/basics/rolling-update.yaml | Adds deprecated v1alpha1 rolling update example. |
| examples/deprecated/basics/rolling-update-with-partition.yaml | Adds deprecated v1alpha1 rolling update with partition example. |
| examples/deprecated/basics/restart-policy.yaml | Adds deprecated v1alpha1 restart policy example. |
| examples/deprecated/basics/rbg-with-roletemplates.yaml | Adds deprecated v1alpha1 roleTemplates usage example. |
| examples/deprecated/basics/rbg-with-lws-workload.yaml | Adds deprecated v1alpha1 LWS workload example. |
| examples/deprecated/basics/rbg-base.yaml | Adds deprecated v1alpha1 base multi-role example. |
| examples/deprecated/basics/exclusive-topology.yaml | Adds deprecated v1alpha1 exclusive topology example. |
| examples/deprecated/basics/coordination.yaml | Adds deprecated v1alpha1 inline coordination rolling update example. |
| examples/basic/rbgs/rbgs-exclusive-topology.yaml | Adds v1alpha2 RBGS exclusive topology example using groupTemplate. |
| examples/basic/rbgs/rbgs-base.yaml | Adds v1alpha2 RBGS base example using groupTemplate. |
| examples/basic/rbg/update-strategy/rolling-update.yaml | Adds v1alpha2 rolling update strategy example (RecreatePod/InPlaceIfPossible/InPlaceOnly). |
| examples/basic/rbg/update-strategy/rolling-update-with-partition.yaml | Adds v1alpha2 rolling update partition + paused rollout example. |
| examples/basic/rbg/scheduling/volcano-gang.yaml | Adds v1alpha2 Volcano gang scheduling example + required controller configuration notes. |
| examples/basic/rbg/scheduling/scheduler-plugins-gang.yaml | Adds v1alpha2 scheduler-plugins gang scheduling example + notes. |
| examples/basic/rbg/scheduling/exclusive-topology.yaml | Adds v1alpha2 RBG exclusive topology example. |
| examples/basic/rbg/scaling/scaling-adapter-with-hpa.yaml | Adds v1alpha2 scaling adapter examples with HPA targeting adapters. |
| examples/basic/rbg/role-temlate/rbg-with-roletemplates.yaml | Adds v1alpha2 roleTemplates + templateRef.patch example. |
| examples/basic/rbg/restart-policy/restart-policy.yaml | Adds v1alpha2 restart policy example for multiple roles. |
| examples/basic/rbg/patterns/standalone-pattern.yaml | Adds v1alpha2 standalonePattern example. |
| examples/basic/rbg/patterns/leader-worker-pattern.yaml | Adds v1alpha2 leaderWorkerPattern example (plus a templateRef-based variant). |
| examples/basic/rbg/patterns/custom-components-pattern.yaml | Adds v1alpha2 customComponentsPattern example. |
| examples/basic/rbg/dependency/role-dependencies.yaml | Adds v1alpha2 role dependency examples (router-workers, PD chain, master-workers). |
| examples/basic/rbg/base.yaml | Adds v1alpha2 base multi-role example. |
| examples/basic/engine-runtime/engine-runtime-profile.yaml | Adds v1alpha2 ClusterEngineRuntimeProfile example + injection via engineRuntimes. |
| examples/basic/coordinated-policy/coordinated-scaling.yaml | Adds v1alpha2 coordinated scaling examples via CoordinatedPolicy CRD. |
| examples/basic/coordinated-policy/coordinated-rolling-update.yaml | Adds v1alpha2 coordinated rolling update example via CoordinatedPolicy CRD. |
| doc/quick_start.md | Updates PD-disagg example links. |
| doc/features/exclusive-topology.md | Updates RBGS exclusive topology example link. |
| doc/TOC.md | Updates PD-disagg example links in the TOC. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6648130 to
37ff926
Compare
Ⅰ. Motivation
Ⅱ. Modifications
Ⅲ. Does this pull request fix one issue?
fixes #XXXX
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.