chore(community): add examples for dynamo#255
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces several example YAML configurations for deploying the NVIDIA Dynamo SGLang runtime using the RoleBasedGroup v1alpha2 API, including aggregated and PD-disaggregated architectures for single and multi-node setups. The review feedback recommends improving the resilience of the backend roles by using RecreateRoleInstanceOnPodRestart instead of None for the restartPolicy and suggests pinning the NATS image to a specific version tag for better stability.
There was a problem hiding this comment.
Pull request overview
Adds Kubernetes example manifests under examples/inference/others/dynamo/ to demonstrate running NVIDIA Dynamo SGLang Runtime on the project’s workloads.x-k8s.io/v1alpha2 RoleBasedGroup API (aggregated and prefill/decode-disaggregated), including supporting ETCD/NATS deployments.
Changes:
- Added aggregated and PD-disaggregated RoleBasedGroup example manifests for Dynamo SGLang Runtime.
- Added “multi-node / tensor-parallel” variants intended to use
leaderWorkerPattern. - Added minimal ETCD and NATS manifests required by the Dynamo runtime.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/inference/others/dynamo/agg.yaml | Aggregated (processor + backend) Dynamo inference example using standalonePattern and roleTemplates. |
| examples/inference/others/dynamo/pd-disagg.yaml | PD-disaggregated (processor + prefill + decode) Dynamo inference example using standalonePattern and roleTemplates. |
| examples/inference/others/dynamo/agg-multi-nodes.yaml | Tensor-parallel aggregated example intended to use leaderWorkerPattern. |
| examples/inference/others/dynamo/pd-disagg-multi-nodes.yaml | Tensor-parallel PD-disaggregated example intended to use leaderWorkerPattern. |
| examples/inference/others/dynamo/etcd.yaml | ETCD service + StatefulSet manifest for Dynamo coordination. |
| examples/inference/others/dynamo/nats.yaml | NATS service + deployment manifest for Dynamo coordination. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| leaderWorkerPattern: | ||
| size: 2 # 1 leader + 1 worker per instance | ||
| templateRef: | ||
| name: dynamo-base | ||
| patch: |
There was a problem hiding this comment.
leaderWorkerPattern roles cannot use templateRef/roleTemplates. The controller validation rejects templateRef when leaderWorkerPattern is set; use an inline leaderWorkerPattern.template (and optionally leaderTemplatePatch/workerTemplatePatch) instead of templateRef here, otherwise this example will fail to reconcile.
| leaderWorkerPattern: | ||
| size: 2 # 1 leader + 1 worker per instance | ||
| templateRef: | ||
| name: dynamo-base | ||
| patch: |
There was a problem hiding this comment.
leaderWorkerPattern does not support templateRef/roleTemplates (controller validation rejects it when leaderWorkerPattern is set). Switch this role to an inline leaderWorkerPattern.template (and use leaderTemplatePatch/workerTemplatePatch if needed) so the manifest can be applied successfully.
| leaderWorkerPattern: | ||
| size: 2 # 1 leader + 1 worker per instance | ||
| templateRef: | ||
| name: dynamo-base | ||
| patch: |
There was a problem hiding this comment.
Same issue as above: leaderWorkerPattern roles cannot use templateRef/roleTemplates per controller validation. Replace templateRef with an inline leaderWorkerPattern.template (and patches if needed) to keep the example functional.
| spec: | ||
| containers: | ||
| - name: nats | ||
| image: nats:latest |
There was a problem hiding this comment.
Using the floating nats:latest image tag makes this example non-reproducible and can unexpectedly break as upstream releases change. Pin the NATS image to a specific version (as done in other examples) to keep the manifest stable over time.
| image: nats:latest | |
| image: nats:2.10.14 |
| env: | ||
| - name: ETCDCTL_API | ||
| value: "3" | ||
| - name: ALLOW_NONE_AUTHENTICATION | ||
| value: "yes" | ||
| volumes: |
There was a problem hiding this comment.
This ETCD manifest explicitly enables ALLOW_NONE_AUTHENTICATION=yes, which is an insecure configuration if applied outside a local/dev cluster. At minimum, add a prominent note that this is for demo purposes only (or provide a secured/default-auth alternative manifest).
| spec: | ||
| containers: | ||
| - name: nats | ||
| image: nats:latest |
There was a problem hiding this comment.
Please pin this image to a specific version instead of nats:latest. For examples, a floating tag makes the manifest non-reproducible and can break unexpectedly when upstream changes. The older Dynamo example in this repo already uses a fixed NATS version, so it would be better to stay consistent here too.
Ⅰ. Motivation
Ⅱ. Modifications
Ⅲ. Does this pull request fix one issue?
fixes #XXXX
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.