chore: add examples for mooncake integration with v1alpha2 api#272
chore: add examples for mooncake integration with v1alpha2 api#272cheyang merged 1 commit intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces two new example configurations for LLM inference using SGLang and Mooncake KV Cache on Kubernetes using the RoleBasedGroup v1alpha2 API: one for aggregated deployment and another for Prefill-Decode (PD) disaggregated deployment. Feedback highlights a critical missing configuration in the PD-disaggregated example where the decode role lacks the necessary Mooncake connection settings and transfer backend flags to participate in the caching mechanism. Additionally, there is a discrepancy in the aggregated example where the rolloutStrategy mentioned in the file header is not actually implemented in the resource specification.
There was a problem hiding this comment.
Pull request overview
Adds v1alpha2 RoleBasedGroup example manifests demonstrating SGLang + Mooncake KV cache deployments, intended to replace/modernize deprecated v1alpha1 Mooncake examples.
Changes:
- Added a v1alpha2 aggregated inference example using Mooncake (master/store/worker).
- Added a v1alpha2 PD-disaggregated inference example using Mooncake (master/store/prefill/decode/router).
- Used
roleTemplates+standalonePattern.templateRef.patchto reduce duplication across roles.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| examples/inference/ecosystem/mooncake/agg-with-mooncake.yaml | New aggregated Mooncake-backed SGLang example (master/store/worker + Service). |
| examples/inference/ecosystem/mooncake/pd-disagg-with-mooncake.yaml | New PD-disaggregated Mooncake-backed SGLang example (master/store/prefill/decode/router + Service). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: 玖宇 <[email protected]>
951ab61 to
033f932
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds new v1alpha2 example manifests demonstrating Mooncake integrations for inference (SGLang / vLLM PD-disaggregation) and for running Mooncake as a reusable KV-cache service.
Changes:
- Add PD-disaggregated inference examples using Mooncake transfer backend (SGLang + vLLM).
- Add v1alpha2 Mooncake KV-cache examples (standalone service, aggregated inference, PD-disaggregated inference with cache reuse).
- Standardize these examples on
RoleBasedGroupv1alpha2 withstandalonePatternandroleTemplates.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/inference/ecosystem/mooncake/mooncake-transfer-engine/vllm-pd-disgg-with-mooncake-te.yaml | vLLM PD-disaggregation example using Mooncake transfer backend + router Service |
| examples/inference/ecosystem/mooncake/mooncake-transfer-engine/sgl-pd-disgg-with-mooncake-te.yaml | SGLang PD-disaggregation example using Mooncake transfer backend + router Service |
| examples/inference/ecosystem/mooncake/mooncake-store/standalone-mooncake-store.yaml | Standalone Mooncake master/store deployment to be reused by other RBGs |
| examples/inference/ecosystem/mooncake/mooncake-store/pd-disagg-kvcache-reuse-with-mooncake.yaml | PD-disaggregated SGLang example embedding Mooncake master/store for KV-cache reuse |
| examples/inference/ecosystem/mooncake/mooncake-store/agg-kvcache-reuse-with-mooncake.yaml | Aggregated SGLang example embedding Mooncake master/store for KV-cache reuse |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # 2. In your inference RBG, configure the following environment variables: | ||
| # - MOONCAKE_MASTER: "s-<rbg-name>-mooncake-master:50051" | ||
| # Example: "s-mooncake-service-mooncake-master:50051" | ||
| # - MOONCAKE_TE_META_DATA_SERVER: "http://s-<rbg-name>-mooncake-master:8080/metadata" | ||
| # Example: "http://s-mooncake-service-mooncake-master:8080/metadata" |
| # vLLM runtime with Mooncake as the KV transfer backend. | ||
| # | ||
| # Architecture: | ||
| # - router: vLLM Router (NIXL-based) for request routing between prefill and decode |
| # | ||
| # Environment variables: | ||
| # - VLLM_MOONCAKE_BOOTSTRAP_PORT: Port for Mooncake bootstrap server (default: 8998) | ||
| # Required only for prefiller instances |
| - --rpc_port | ||
| - "50051" | ||
| - --http_metadata_server_host | ||
| - $(POD_IP) | ||
| - --http_metadata_server_port |
|
LGTM |
Ⅰ. Motivation
Ⅱ. Modifications
Ⅲ. Does this pull request fix one issue?
fixes #XXXX
Ⅳ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.
Ⅴ. Describe how to verify it
VI. Special notes for reviews
Checklist
make fmt.