Releases: llm-d/llm-d
Release v0.4.0
📦 llm-d v0.4.0 Release Notes
This release of the llm-d repo will capture the release for the entirety of the project, guides, components, and all.
Release Date: 2025-11-26
🧩 Component Summary
| Component | Version | Previous Version | Type |
|---|---|---|---|
| llmd/llm-d-inference-scheduler | v0.4.0-rc.1 |
v0.3.1 |
Image |
| llm-d-incubation/llm-d-modelservice | v0.3.8 |
v0.2.10 |
Helm Chart |
| llm-d/llm-d-routing-sidecar | v0.4.0-rc.1 |
v0.3.1 |
Image |
| llm-d/llm-d-cuda | v0.4.0 |
v0.3.1 |
Image |
| llm-d/llm-d-aws | v0.4.0 |
v0.3.1 |
Image |
| llm-d/llm-d-xpu | v0.4.0 |
v0.3.1 |
Image |
| llm-d/llm-d-cpu | v0.4.0 |
v0.3.1 |
Image (New) |
| llm-d-incubation/llm-d-infra | v1.3.4 |
v1.3.3 |
Helm Chart |
| kubernetes-sig/gateway-api-inference-extension | v1.2.0-rc.1 |
v1.0.1 |
Helm Chart |
| llm-d/llm-d-workload-variant-autoscaler | v0.0.8 |
NA (new) | Helm Chart + Image |
🔹 lmd/llm-d-inference-scheduler
- Description: This scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.
- Diff: v0.3.1 → v0.4.0-rc.1
🔹 llm-d-incubation/llm-d-modelservice
- Description:
modelserviceis a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet). - Diff: v0.2.10 → v0.3.8
🔹 llm-d/llm-d-routing-sidecar
- Description: A reverse proxy redirecting incoming requests to the prefill worker specified in the x-prefiller-host-port HTTP request header.
- Diff: v0.3.1 → v0.4.0-rc.1
🔹 llm-d/llm-d
- Description: A midstreamed image of
vllm-project/vllmfor inferencing, supporting features such as PD disaggregation, KV cache awareness and more. - Diff: v0.3.1 → v0.4.0
- Image Variants: Different image variants of this component:
- XPU:
ghcr.io/llm-d/llm-d-xpu:v0.4.0 - AWS:
ghcr.io/llm-d/llm-d-aws:v0.4.0 - CUDA:
ghcr.io/llm-d/llm-d-cuda:v0.4.0 - CPU:
ghcr.io/llm-d/llm-d-cpu:v0.4.0
- XPU:
🔹 llm-d-incubation/llm-d-infra
- Description: A helm chart for deploying gateway and gateway related infrastructure assets for llm-d.
- Diff: v1.3.3 → v1.3.4
🔹 kubernetes-sig/gateway-api-inference-extension
- Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
- Diff: v1.0.1 → v1.2.0-rc.1
🔹 llm-d/llm-d-workload-variant-autoscaler (New - Experimental)
- Description: Variant optimization autoscaler for distributed inference workloads
- History (new): v0.0.5
- Note: This is an experimental component being included in this release for early testing and feedback.
For more information on any of the component project or versions, please checkout their repos directly. For information on installing and using the new release refer to our guides. Thank you to all contributors who helped make this happen. Automated release notes will be included below, but it should be noted this only tracks work in the main repo, and does not fully reflect a changelog across the project
What's Changed
- Add umbrella kv cache offloading well-lit path folder structure by @liu-cong in #401
- Correct wide-ep resource requirements. by @liu-cong in #373
- add information about component testing by @Gregory-Pereira in #361
- doc(guides): Introduce standardized recipes for Gateway, InferencePool, and vLLM by @zetxqx in #444
- Fix a broken link in the cpu prefix cache readme by @smarterclayton in #451
- Add more GKE specific workarounds and known issues by @smarterclayton in #419
- Update SIGs documentation to remove outdated schedule details. by @petecheslock in #431
- Update links to deploying vLLM multi-host in stable docs by @smarterclayton in #436
- fix kutomization error and model flag error in cpu offloading. by @zetxqx in #453
- Add GKE B200 readme notes by @smarterclayton in #454
- doc: enrich the prefix-cache-storage vllm cpu native offloading with benchmark results by @zetxqx in #438
- Add CPU for llm-d Inference Scheduling by @ZhengHongming888 in #428
- Add cpu offloading example for GKE + LMCache by @dannawang0221 in #318
- Add tab format for better UX on the website by @liu-cong in #452
- Rename
prefix-cache-storagetotiered-prefix-cacheby @vMaroon in #468 - Remove the dockerfile.gke as it is no longer used by @smarterclayton in #462
- Token credentials fix + vLLM v0.11.1 by @Gregory-Pereira in #456
- guides: Make vLLM log more useful in inference-scheduling by @russellb in #439
- Inference scheduling support for Intel Gaudi accelerator by @poussa in #374
- Add JIT directories and model directories by @smarterclayton in #418
- Use markdown comments for Tabs support on docusaurus by @petecheslock in #474
- Highlight P/D benefits with throughput-interactivity tradeoff by @liu-cong in #472
- add benchmark results lmcache results and tuned epp scorers by @zetxqx in #457
- Add step by step guide for setting up p/d with TPU on GKE by @yangligt2 in #443
- refactor: restructure vllm recipe with base and overlay pattern by @diego-torres in #475
- [Build] Add FI JIT Cache to Image by @robertgshaw2-redhat in #482
- Add instructions to clone git repo and checkout the release by @liu-cong in #477
- Create CPU dockefile for PD and Inference Scheduling by @ZhengHongming888 in #465
- guides/prereq/client-setup/install-deps.sh - increment HELMFILE_VERSION to 1.2.1 by @herbertkb in #492
- docs: Addresses CPU support added in PR #428 by @aneeshkp in #466
- Infra, MS and GAIE bumps + istio change compat by @Gregory-Pereira in #459
- Update release version for cpu offloading guide by @liu-cong in #495
- enable TLS in monitoring for prom by @Gregory-Pereira in #496
- helmfile and supporting artifacts for wva by @clubanderson in #464
- updating LMCACHe to be non fork by @Gregory-Pereira in #501
- component bumps for WVA guide by @Gregory-Pereira in #502
- Build vLLM 0.11.2 + patches for 0.4 by @smarterclayton in #461
- Avoid defining LMCACHE_COMMIT_SHA in multiple places by @terrytangyuan in #503
- WVA guide integration targeting v0.4 by @mamy-CS in #470
- fixing AWS image by @Gregory-Pereira in #506
- remove pre-passing values for VLLM by @Gregory-Pereira in #507
New Contributors
- @zetxqx made their first contribution in #444
- @ZhengHongming888 made their first contribution in #428
- @dannawang0221 made their first contribution in #318
- @russellb made their first contribution in #439
- @poussa made their first contribution in #374
- @yangligt2 made their first contribution in #443
- @diego-torres made their first contribution in #475
- @herbertkb made their first contribution in #492
- @aneeshkp made their first contribution in #466
- @mamy-CS made their first contribution in #470
Full Changelog: v0.3.1...v0.4.0
v0.3.1 Release
Release overview
This release was focused on following up on our objectives from the v0.3.0 that could not make it into that release. A few key stories to highlight:
- ARM support
- Refactor image build process to scripts
- Unifying the GKE image into our core Cuda image
- Adding AKS cloud provider support
Welcome to all our new contributors, and thanks to the team for their hard work.
Component version bumps:
- Inference SIM (
v0.5.1-->v0.6.1) - llm-d image (
v0.3.0-->v0.3.1, diff encapsulated in change-log below)
What's Changed
- [bugfix] changing filename to not reference old plugin by @Gregory-Pereira in #353
- Deprecate all InferenceModel in XPU guides by @yankay in #358
- version bump on vllm release v0.11.0 tag move by @Gregory-Pereira in #365
- Update SIGS.md owners by @petecheslock in #363
- Clear /dev/shm before process startup to prevent crashloops by @smarterclayton in #364
- Add hardware and platform support issue template by @Ayobami-00 in #359
- minor typo in the inference guide by @effi-ofer in #377
- docs: introduce AKS as a well-lit infra provider by @chewong in #335
- Correct # of measured output tokens / s by @smarterclayton in #350
- refactor dockerfiles to a set bash scripts by @wseaton in #324
- Add wide ep gke test by @rlakhtakia in #367
- Intel pd workflow + v0.3 lagging updates by @Gregory-Pereira in #310
- Use the vLLM image for xpu by @yuanwu2017 in #357
- Fix markdown-link-checker failed issue by @yuanwu2017 in #385
- feat: Add updated readiness probe for vLLM containers by @rajinator in #330
- Fix queries and load script in monitoring by @Hritik003 in #383
- Arm cuda support (from clean branch) by @wseaton in #382
- Fix deadlink error in markdown-link-check by @yuanwu2017 in #404
- Update bug report template by @Gregory-Pereira in #413
- Fix release image tagging by @Gregory-Pereira in #379
- Update monitoring install for CKS by @Gregory-Pereira in #375
- Patch nvshmem to avoid an uninitialized value passed to RoCE by @smarterclayton in #407
- [Docs] Fix InferencePool version number getting cut off in WideEP guide by @tlrmchlsmth in #414
- Fix broken monitoring dashboard link by @smarterclayton in #422
- Set correct variables for built NVSHMEM by @smarterclayton in #417
- Update DeepEP to a version with a patch for setting NVSHMEM HCA mappings to CUDA device by @smarterclayton in #397
- swap release to tag creation image tagging by @Gregory-Pereira in #423
- pr vs release tag by @Gregory-Pereira in #424
- enable cache busting by @Gregory-Pereira in #425
- release.tag_name does not exist for a tag not part of a release by @Gregory-Pereira in #427
- unify darwin/arm64 with other platforms when install helmfile in install-deps.sh by @yitingdc in #267
- Set LD_LIBRARY_PATH for nvshmem appropriately by @smarterclayton in #429
- Update GKE to align to UBI images by @smarterclayton in #415
- updating to tags for v0.3.1 release by @Gregory-Pereira in #432
- bugfixing by @Gregory-Pereira in #433
New Contributors
- @yankay made their first contribution in #358
- @Ayobami-00 made their first contribution in #359
- @effi-ofer made their first contribution in #377
- @chewong made their first contribution in #335
- @rlakhtakia made their first contribution in #367
- @rajinator made their first contribution in #330
- @Hritik003 made their first contribution in #383
- @yitingdc made their first contribution in #267
Full Changelog: v0.3.0...v0.3.1
v0.3.0
📦 llm-d v0.3.0 Release Notes
This release of the llm-d repo will capture the release for the entirety of the project, guides, components, and all.
Release Date: 2025-10-10
Core Objectives
This release had a few key objectives:
- Increase support for specialized hardware backends (TPU, XPU)
- Increase cloud provider support (DOKS)
- Establish a metrics and observability story
- Wide-ep optimizations (EPLB, DBO, Async Scheduling, etc.)
🧩 Component Summary
| Component | Version | Previous Version | Type |
|---|---|---|---|
| llm-d/llm-d-inference-scheduler | v0.3.2 |
v0.2.1 |
Image |
| llm-d-incubation/llm-d-modelservice | v0.2.10 |
v0.2.0 |
Helm Chart |
| llm-d/llm-d-routing-sidecar | v0.3.0 |
v0.2.0 |
Image |
| vllm-project/vllm | v0.11.0 |
v0.10.0 |
Editable install based on precompiled wheel |
| llm-d/llm-d-cuda | v0.3.0 |
v0.2.0 |
Image |
| llm-d/llm-d-gke | v0.3.0 |
NA (new) | Image |
| llm-d/llm-d-aws | v0.3.0 |
NA (new) | Image |
| llm-d/llm-d-xpu | v0.3.0 |
NA (new) | Image |
| llm-d/llm-d-inference-sim | v0.5.1 |
v0.3.0 |
Image |
| llm-d-incubation/llm-d-infra | v1.3.3 |
v1.1.1 |
Helm Chart |
| kubernetes-sig/gateway-api-inference-extension | v1.0.1 |
v0.5.1 |
Helm Chart |
| llm-d/llm-d-kv-cache-manager | v0.3.2 |
v0.2.0 |
Go Package (consumed in inference-scheduler) |
| llm-d/llm-d-benchmark | v0.3.0 |
v0.2.0 |
Tooling and Image |
NOTE: In future we want to support compatibility matrixes. However as we are still getting off the ground, we cannot ensure that these components work with legacy versions.
🔹 llm-d/llm-d-inference-scheduler
- Description: This scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.
- Diff: v0.2.1 → v0.3.2
🔹 llm-d-incubation/llm-d-modelservice
- Description:
modelserviceis a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet). - Diff: v0.2.0 → v0.2.10
🔹 llm-d/llm-d-routing-sidecar
- Description: A reverse proxy redirecting incoming requests to the prefill worker specified in the x-prefiller-host-port HTTP request header.
- Diff: v0.2.0 → v0.3.0
🔹 vllm-project/vllm (upstream)
- Description:
vLLMis a fast and easy-to-use library for LLM inference and serving. This project is the inferencing engine that forms the upstream of ourllm-d/llm-dimage. - Diff: v0.10.0 → v0.11.0
🔹 llm-d/llm-d
- Description: A midstreamed image of
vllm-project/vllmfor inferencing, supporting features such as PD disaggregation, KV cache awareness and more. - Diff: v0.2.0 → v0.3.0
- Image Variants: Different image variants of this component:
- XPU:
ghcr.io/llm-d/llm-d-xpu:v0.3.0 - AWS:
ghcr.io/llm-d/llm-d-aws:v0.3.0- Release
v0.3.0workaround for getting EFA to work
- Release
- CUDA:
ghcr.io/llm-d/llm-d-cuda:v0.3.0 - GKE:
ghcr.io/llm-d/llm-d-gke:v0.3.0- Release
v0.3.0workaround for running wide-ep on GKE with H200s
- Release
- XPU:
🔹 llm-d/llm-d-inference-sim
- Description: A light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.
- Diff: v0.3.0 → v0.5.1
🔹 llm-d-incubation/llm-d-infra
- Description: A helm chart for deploying gateway and gateway related infrastructure assets for llm-d.
- Diff: v1.1.1 → v1.3.3
🔹 kubernetes-sig/gateway-api-inference-extension
- Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
- Diff: v0.5.1 → v1.0.1
🔹 llm-d/llm-d-kv-cache-manager
- Description: This repository contains the llm-d-kv-cache-manager, a pluggable service designed to enable KV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
- Diff: v0.2.0 → v0.3.0
🔹 llm-d/llm-d-benchmark
- Description: This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
- Diff: v0.2.0 → v0.3.0
For more information on any of the component project or versions, please checkout their repos directly. For information on installing and using the new release refer to our guides. Thank you to all contributors who helped make this happen.
v0.2.0
📦 llm-d v0.2.0 Release Notes
For information on installing and using the new release refer to our quickstarts.
Release Date: 2025-07-28
Core Objectives
This release had a few key objectives:
- Migrate from monolithic to composable installs based on community feedback
- Support wide expert parallelism cases of "one rank per node"
- Align with upstream gateway-api-inference-extension helm charts
🧩 Component Summary
| Component | Version | Previous Version | Type |
|---|---|---|---|
| llmd/llm-d-inference-scheduler | v0.2.1 |
0.0.4 |
Image |
| llm-d/llm-d-model-service | NA (Deprecated) | 0.0.10 |
Image |
| llm-d-incubation/llm-d-modelservice | v0.2.0 |
NA (New) | Helm Chart |
| llm-d/llm-d-routing-sidecar | v0.2.0 |
0.0.6 |
Image |
| llm-d/llm-d-deployer | NA (Deprecated) | 1.0.22 |
Helm Chart |
| vllm-project/vllm | v0.10.0 |
NA (built from fork) | Wheel installed in llm-d |
| llm-d/llm-d | v0.2.0 |
0.0.8 |
Image |
| llm-d/llm-d-inference-sim | v0.3.0 |
0.0.4 |
Image |
| llm-d-incubation/llm-d-infra | v1.1.1 |
NA (New) | Helm Chart |
| kubernetes-sig/gateway-api-inference-extension | v0.5.1 |
NA (New - external) | Image |
| llm-d/llm-d-kv-cache-manager | v0.2.0 |
v0.1.0 |
Go Package (consumed in inference-scheduler) |
| llm-d/llm-d-benchmark | v0.2.0 |
v0.0.8 |
Tooling and Image |
NOTE: In future we want to support compatibility matrixes. However as we are still getting off the ground, we cannot ensure that these components work with legacy versions.
🔹 llm-d/llm-d-inference-scheduler
- Description: The inference scheduler makes optimized routing decisions for inference requests to vLLM model servers. This component depends on the upstream gateway-api-inference-extension scheduling framework and includes features specific to vLLM.
- Diff: 0.0.4 → v0.2.1
- Upstream Changelog - since we bumped the upstream version of GIE many changes do not show up in the diff. The following has been pulled from release notes:
🔹 llm-d/llm-d-model-service (Deprecated)
- Description:
ModelServiceis a Kubernetes operator (CRD + controller) that enables the creation of vllm pods and routing resources for a given model. - Status: This repo is being deprecated as a component of
llm-dand has been archived. - Replacement:
lm-d-incubation/llm-d-modelservice
🔹 llm-d-incubation/llm-d-modelservice (New)
- Description:
modelserviceis a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet). - History (new): v0.2.0
🔹 llm-d/llm-d-routing-sidecar
- Description: A reverse proxy routing traffic between the
inference-schedulerand the prefill and decode workers based on the x-prefiller-host-port HTTP request header. - Diff: 0.0.6 → v0.2.0
🔹 llm-d/llm-d-deployer (Deprecated)
- Description: A repo containing examples, Helm charts, and release assets for
llm-d. - Status: This repo is being deprecated as a component of
llm-dhowever the repo will not be archived so that it may support people who want to try the legacy install. Will be minimally maintained. - Replacement:
llm-d-incubation/llm-d-infra
🔹 vllm-project/vllm (Upstream)
- Description:
vLLMis a fast and easy-to-use library for LLM inference and serving. This project is the inferencing engine that forms the upstream of ourllm-d/llm-dimage. - Release: v0.10.0
🔹 llm-d/llm-d
- Description: A midstreamed image of
vllm-project/vllmfor inferencing, supporting features such as PD disaggregation, KV cache awareness and more. - Diff: 0.0.8 → v0.2.0
🔹 llm-d/llm-d-inference-sim
- Description: a light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.
- Diff: 0.0.4 → v0.3.0
🔹 llm-d-incubation/llm-d-infra (New)
- Description: This repository includes examples, Helm charts, and release assets for llm-d-infra.
- History (new): v1.1.1
🔹 kubernetes-sig/gateway-api-inference-extension (New - Upstream)
- Note: Note: The upstream project is a dependency of llm-d and we directly reference the published Helm charts. The release notes will not include all changes in this component from release to release, please consult with the upstream release pages.
- Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
- History (new - external): v0.5.1
🔹 llm-d/llm-d-kv-cache-manager
- Description: This repository contains the llm-d-kv-cache-manager, a pluggable service designed to enable KV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
- Diff: v0.1.0 → v0.2.0
🔹 llm-d/llm-d-benchmark
- Description: This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
- Diff: v0.0.8 → v0.2.0
For more information on any of the component project or versions, please checkout their repos directly. Thank you to all contributors who helped make this happen.