Skip to content

chore(community): add examples for dynamo#255

Merged
cheyang merged 3 commits intosgl-project:mainfrom
Syspretor:chore/add-examples-for-dynamo
Apr 3, 2026
Merged

chore(community): add examples for dynamo#255
cheyang merged 3 commits intosgl-project:mainfrom
Syspretor:chore/add-examples-for-dynamo

Conversation

@Syspretor
Copy link
Copy Markdown
Collaborator

Ⅰ. Motivation

Ⅱ. Modifications

Ⅲ. Does this pull request fix one issue?

fixes #XXXX

Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅴ. Describe how to verify it

VI. Special notes for reviews

Checklist

  • Format your code make fmt.
  • Add unit tests or integration tests.
  • Update the documentation related to the change.

@Syspretor Syspretor changed the title Chore/add examples for dynamo chore(community): add examples for dynamo Apr 1, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several example YAML configurations for deploying the NVIDIA Dynamo SGLang runtime using the RoleBasedGroup v1alpha2 API, including aggregated and PD-disaggregated architectures for single and multi-node setups. The review feedback recommends improving the resilience of the backend roles by using RecreateRoleInstanceOnPodRestart instead of None for the restartPolicy and suggests pinning the NATS image to a specific version tag for better stability.

Comment thread examples/inference/ecosystem/dynamo/agg-multi-nodes.yaml
Comment thread examples/inference/others/dynamo/nats.yaml Outdated
Comment thread examples/inference/ecosystem/dynamo/pd-disagg-multi-nodes.yaml
Comment thread examples/inference/ecosystem/dynamo/pd-disagg-multi-nodes.yaml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Kubernetes example manifests under examples/inference/others/dynamo/ to demonstrate running NVIDIA Dynamo SGLang Runtime on the project’s workloads.x-k8s.io/v1alpha2 RoleBasedGroup API (aggregated and prefill/decode-disaggregated), including supporting ETCD/NATS deployments.

Changes:

  • Added aggregated and PD-disaggregated RoleBasedGroup example manifests for Dynamo SGLang Runtime.
  • Added “multi-node / tensor-parallel” variants intended to use leaderWorkerPattern.
  • Added minimal ETCD and NATS manifests required by the Dynamo runtime.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
examples/inference/others/dynamo/agg.yaml Aggregated (processor + backend) Dynamo inference example using standalonePattern and roleTemplates.
examples/inference/others/dynamo/pd-disagg.yaml PD-disaggregated (processor + prefill + decode) Dynamo inference example using standalonePattern and roleTemplates.
examples/inference/others/dynamo/agg-multi-nodes.yaml Tensor-parallel aggregated example intended to use leaderWorkerPattern.
examples/inference/others/dynamo/pd-disagg-multi-nodes.yaml Tensor-parallel PD-disaggregated example intended to use leaderWorkerPattern.
examples/inference/others/dynamo/etcd.yaml ETCD service + StatefulSet manifest for Dynamo coordination.
examples/inference/others/dynamo/nats.yaml NATS service + deployment manifest for Dynamo coordination.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +110 to +114
leaderWorkerPattern:
size: 2 # 1 leader + 1 worker per instance
templateRef:
name: dynamo-base
patch:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaderWorkerPattern roles cannot use templateRef/roleTemplates. The controller validation rejects templateRef when leaderWorkerPattern is set; use an inline leaderWorkerPattern.template (and optionally leaderTemplatePatch/workerTemplatePatch) instead of templateRef here, otherwise this example will fail to reconcile.

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +112
leaderWorkerPattern:
size: 2 # 1 leader + 1 worker per instance
templateRef:
name: dynamo-base
patch:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaderWorkerPattern does not support templateRef/roleTemplates (controller validation rejects it when leaderWorkerPattern is set). Switch this role to an inline leaderWorkerPattern.template (and use leaderTemplatePatch/workerTemplatePatch if needed) so the manifest can be applied successfully.

Copilot uses AI. Check for mistakes.
Comment on lines +173 to +177
leaderWorkerPattern:
size: 2 # 1 leader + 1 worker per instance
templateRef:
name: dynamo-base
patch:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: leaderWorkerPattern roles cannot use templateRef/roleTemplates per controller validation. Replace templateRef with an inline leaderWorkerPattern.template (and patches if needed) to keep the example functional.

Copilot uses AI. Check for mistakes.
spec:
containers:
- name: nats
image: nats:latest
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the floating nats:latest image tag makes this example non-reproducible and can unexpectedly break as upstream releases change. Pin the NATS image to a specific version (as done in other examples) to keep the manifest stable over time.

Suggested change
image: nats:latest
image: nats:2.10.14

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +45
env:
- name: ETCDCTL_API
value: "3"
- name: ALLOW_NONE_AUTHENTICATION
value: "yes"
volumes:
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ETCD manifest explicitly enables ALLOW_NONE_AUTHENTICATION=yes, which is an insecure configuration if applied outside a local/dev cluster. At minimum, add a prominent note that this is for demo purposes only (or provide a secured/default-auth alternative manifest).

Copilot uses AI. Check for mistakes.
spec:
containers:
- name: nats
image: nats:latest
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pin this image to a specific version instead of nats:latest. For examples, a floating tag makes the manifest non-reproducible and can break unexpectedly when upstream changes. The older Dynamo example in this repo already uses a fixed NATS version, so it would be better to stay consistent here too.

Comment thread examples/inference/ecosystem/dynamo/etcd.yaml
Copy link
Copy Markdown
Collaborator

@cheyang cheyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@cheyang cheyang merged commit a2a0de1 into sgl-project:main Apr 3, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants