📦 llm-d v0.4.0 Release Notes

@liu-cong

📦 llm-d v0.4.0 Release Notes

This release of the llm-d repo will capture the release for the entirety of the project, guides, components, and all.

Release Date: 2025-11-26

🧩 Component Summary

Component	Version	Previous Version	Type
llmd/llm-d-inference-scheduler	`v0.4.0-rc.1`	`v0.3.1`	Image
llm-d-incubation/llm-d-modelservice	`v0.3.8`	`v0.2.10`	Helm Chart
llm-d/llm-d-routing-sidecar	`v0.4.0-rc.1`	`v0.3.1`	Image
llm-d/llm-d-cuda	`v0.4.0`	`v0.3.1`	Image
llm-d/llm-d-aws	`v0.4.0`	`v0.3.1`	Image
llm-d/llm-d-xpu	`v0.4.0`	`v0.3.1`	Image
llm-d/llm-d-cpu	`v0.4.0`	`v0.3.1`	Image (New)
llm-d-incubation/llm-d-infra	`v1.3.4`	`v1.3.3`	Helm Chart
kubernetes-sig/gateway-api-inference-extension	`v1.2.0-rc.1`	`v1.0.1`	Helm Chart
llm-d/llm-d-workload-variant-autoscaler	`v0.0.8`	NA (new)	Helm Chart + Image

🔹 lmd/llm-d-inference-scheduler

Description: This scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.
Diff: v0.3.1 → v0.4.0-rc.1

🔹 llm-d-incubation/llm-d-modelservice

Description: modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).
Diff: v0.2.10 → v0.3.8

🔹 llm-d/llm-d-routing-sidecar

Description: A reverse proxy redirecting incoming requests to the prefill worker specified in the x-prefiller-host-port HTTP request header.
Diff: v0.3.1 → v0.4.0-rc.1

🔹 llm-d/llm-d

Description: A midstreamed image of vllm-project/vllm for inferencing, supporting features such as PD disaggregation, KV cache awareness and more.
Diff: v0.3.1 → v0.4.0
Image Variants: Different image variants of this component:
- XPU: ghcr.io/llm-d/llm-d-xpu:v0.4.0
- AWS: ghcr.io/llm-d/llm-d-aws:v0.4.0
- CUDA: ghcr.io/llm-d/llm-d-cuda:v0.4.0
- CPU: ghcr.io/llm-d/llm-d-cpu:v0.4.0

🔹 llm-d-incubation/llm-d-infra

Description: A helm chart for deploying gateway and gateway related infrastructure assets for llm-d.
Diff: v1.3.3 → v1.3.4

🔹 kubernetes-sig/gateway-api-inference-extension

Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
Diff: v1.0.1 → v1.2.0-rc.1

🔹 llm-d/llm-d-workload-variant-autoscaler (New - Experimental)

Description: Variant optimization autoscaler for distributed inference workloads
History (new): v0.0.5
Note: This is an experimental component being included in this release for early testing and feedback.

For more information on any of the component project or versions, please checkout their repos directly. For information on installing and using the new release refer to our guides. Thank you to all contributors who helped make this happen. Automated release notes will be included below, but it should be noted this only tracks work in the main repo, and does not fully reflect a changelog across the project

What's Changed

Add umbrella kv cache offloading well-lit path folder structure by @liu-cong in #401
Correct wide-ep resource requirements. by @liu-cong in #373
add information about component testing by @Gregory-Pereira in #361
doc(guides): Introduce standardized recipes for Gateway, InferencePool, and vLLM by @zetxqx in #444
Fix a broken link in the cpu prefix cache readme by @smarterclayton in #451
Add more GKE specific workarounds and known issues by @smarterclayton in #419
Update SIGs documentation to remove outdated schedule details. by @petecheslock in #431
Update links to deploying vLLM multi-host in stable docs by @smarterclayton in #436
fix kutomization error and model flag error in cpu offloading. by @zetxqx in #453
Add GKE B200 readme notes by @smarterclayton in #454
doc: enrich the prefix-cache-storage vllm cpu native offloading with benchmark results by @zetxqx in #438
Add CPU for llm-d Inference Scheduling by @ZhengHongming888 in #428
Add cpu offloading example for GKE + LMCache by @dannawang0221 in #318
Add tab format for better UX on the website by @liu-cong in #452
Rename prefix-cache-storage to tiered-prefix-cache by @vMaroon in #468
Remove the dockerfile.gke as it is no longer used by @smarterclayton in #462
Token credentials fix + vLLM v0.11.1 by @Gregory-Pereira in #456
guides: Make vLLM log more useful in inference-scheduling by @russellb in #439
Inference scheduling support for Intel Gaudi accelerator by @poussa in #374
Add JIT directories and model directories by @smarterclayton in #418
Use markdown comments for Tabs support on docusaurus by @petecheslock in #474
Highlight P/D benefits with throughput-interactivity tradeoff by @liu-cong in #472
add benchmark results lmcache results and tuned epp scorers by @zetxqx in #457
Add step by step guide for setting up p/d with TPU on GKE by @yangligt2 in #443
refactor: restructure vllm recipe with base and overlay pattern by @diego-torres in #475
[Build] Add FI JIT Cache to Image by @robertgshaw2-redhat in #482
Add instructions to clone git repo and checkout the release by @liu-cong in #477
Create CPU dockefile for PD and Inference Scheduling by @ZhengHongming888 in #465
guides/prereq/client-setup/install-deps.sh - increment HELMFILE_VERSION to 1.2.1 by @herbertkb in #492
docs: Addresses CPU support added in PR #428 by @aneeshkp in #466
Infra, MS and GAIE bumps + istio change compat by @Gregory-Pereira in #459
Update release version for cpu offloading guide by @liu-cong in #495
enable TLS in monitoring for prom by @Gregory-Pereira in #496
helmfile and supporting artifacts for wva by @clubanderson in #464
updating LMCACHe to be non fork by @Gregory-Pereira in #501
component bumps for WVA guide by @Gregory-Pereira in #502
Build vLLM 0.11.2 + patches for 0.4 by @smarterclayton in #461
Avoid defining LMCACHE_COMMIT_SHA in multiple places by @terrytangyuan in #503
WVA guide integration targeting v0.4 by @mamy-CS in #470
fixing AWS image by @Gregory-Pereira in #506
remove pre-passing values for VLLM by @Gregory-Pereira in #507

New Contributors

@zetxqx made their first contribution in #444
@ZhengHongming888 made their first contribution in #428
@dannawang0221 made their first contribution in #318
@russellb made their first contribution in #439
@poussa made their first contribution in #374
@yangligt2 made their first contribution in #443
@diego-torres made their first contribution in #475
@herbertkb made their first contribution in #492
@aneeshkp made their first contribution in #466
@mamy-CS made their first contribution in #470

Full Changelog: v0.3.1...v0.4.0

@Gregory-Pereira

Release overview

This release was focused on following up on our objectives from the v0.3.0 that could not make it into that release. A few key stories to highlight:

ARM support
Refactor image build process to scripts
Unifying the GKE image into our core Cuda image
Adding AKS cloud provider support

Welcome to all our new contributors, and thanks to the team for their hard work.

Component version bumps:

Inference SIM (v0.5.1 --> v0.6.1)
llm-d image (v0.3.0 --> v0.3.1, diff encapsulated in change-log below)

What's Changed

[bugfix] changing filename to not reference old plugin by @Gregory-Pereira in #353
Deprecate all InferenceModel in XPU guides by @yankay in #358
version bump on vllm release v0.11.0 tag move by @Gregory-Pereira in #365
Update SIGS.md owners by @petecheslock in #363
Clear /dev/shm before process startup to prevent crashloops by @smarterclayton in #364
Add hardware and platform support issue template by @Ayobami-00 in #359
minor typo in the inference guide by @effi-ofer in #377
docs: introduce AKS as a well-lit infra provider by @chewong in #335
Correct # of measured output tokens / s by @smarterclayton in #350
refactor dockerfiles to a set bash scripts by @wseaton in #324
Add wide ep gke test by @rlakhtakia in #367
Intel pd workflow + v0.3 lagging updates by @Gregory-Pereira in #310
Use the vLLM image for xpu by @yuanwu2017 in #357
Fix markdown-link-checker failed issue by @yuanwu2017 in #385
feat: Add updated readiness probe for vLLM containers by @rajinator in #330
Fix queries and load script in monitoring by @Hritik003 in #383
Arm cuda support (from clean branch) by @wseaton in #382
Fix deadlink error in markdown-link-check by @yuanwu2017 in #404
Update bug report template by @Gregory-Pereira in #413
Fix release image tagging by @Gregory-Pereira in #379
Update monitoring install for CKS by @Gregory-Pereira in #375
Patch nvshmem to avoid an uninitialized value passed to RoCE by @smarterclayton in #407
[Docs] Fix InferencePool version number getting cut off in WideEP guide by @tlrmchlsmth in #414
Fix broken monitoring dashboard link by @smarterclayton in #422
Set correct variables for built NVSHMEM by @smarterclayton in #417
Update DeepEP to a version with a patch for setting NVSHMEM HCA mappings to CUDA device by @smarterclayton in #397
swap release to tag creation image tagging by @Gregory-Pereira in #423
pr vs release tag by @Gregory-Pereira in #424
enable cache busting by @Gregory-Pereira in #425
release.tag_name does not exist for a tag not part of a release by @Gregory-Pereira in #427
unify darwin/arm64 with other platforms when install helmfile in install-deps.sh by @yitingdc in #267
Set LD_LIBRARY_PATH for nvshmem appropriately by @smarterclayton in #429
Update GKE to align to UBI images by @smarterclayton in #415
updating to tags for v0.3.1 release by @Gregory-Pereira in #432
bugfixing by @Gregory-Pereira in #433

New Contributors

@yankay made their first contribution in #358
@Ayobami-00 made their first contribution in #359
@effi-ofer made their first contribution in #377
@chewong made their first contribution in #335
@rlakhtakia made their first contribution in #367
@rajinator made their first contribution in #330
@Hritik003 made their first contribution in #383
@yitingdc made their first contribution in #267

Full Changelog: v0.3.0...v0.3.1

📦 llm-d v0.3.0 Release Notes

This release of the llm-d repo will capture the release for the entirety of the project, guides, components, and all.

Release Date: 2025-10-10

Core Objectives

This release had a few key objectives:

Increase support for specialized hardware backends (TPU, XPU)
Increase cloud provider support (DOKS)
Establish a metrics and observability story
Wide-ep optimizations (EPLB, DBO, Async Scheduling, etc.)

🧩 Component Summary

Component	Version	Previous Version	Type
llm-d/llm-d-inference-scheduler	`v0.3.2`	`v0.2.1`	Image
llm-d-incubation/llm-d-modelservice	`v0.2.10`	`v0.2.0`	Helm Chart
llm-d/llm-d-routing-sidecar	`v0.3.0`	`v0.2.0`	Image
vllm-project/vllm	`v0.11.0`	`v0.10.0`	Editable install based on precompiled wheel
llm-d/llm-d-cuda	`v0.3.0`	`v0.2.0`	Image
llm-d/llm-d-gke	`v0.3.0`	NA (new)	Image
llm-d/llm-d-aws	`v0.3.0`	NA (new)	Image
llm-d/llm-d-xpu	`v0.3.0`	NA (new)	Image
llm-d/llm-d-inference-sim	`v0.5.1`	`v0.3.0`	Image
llm-d-incubation/llm-d-infra	`v1.3.3`	`v1.1.1`	Helm Chart
kubernetes-sig/gateway-api-inference-extension	`v1.0.1`	`v0.5.1`	Helm Chart
llm-d/llm-d-kv-cache-manager	`v0.3.2`	`v0.2.0`	Go Package (consumed in `inference-scheduler`)
llm-d/llm-d-benchmark	`v0.3.0`	`v0.2.0`	Tooling and Image

NOTE: In future we want to support compatibility matrixes. However as we are still getting off the ground, we cannot ensure that these components work with legacy versions.

🔹 llm-d/llm-d-inference-scheduler

Description: This scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.
Diff: v0.2.1 → v0.3.2

🔹 llm-d-incubation/llm-d-modelservice

Description: modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).
Diff: v0.2.0 → v0.2.10

🔹 llm-d/llm-d-routing-sidecar

Description: A reverse proxy redirecting incoming requests to the prefill worker specified in the x-prefiller-host-port HTTP request header.
Diff: v0.2.0 → v0.3.0

🔹 vllm-project/vllm (upstream)

Description: vLLM is a fast and easy-to-use library for LLM inference and serving. This project is the inferencing engine that forms the upstream of our llm-d/llm-d image.
Diff: v0.10.0 → v0.11.0

🔹 llm-d/llm-d

Description: A midstreamed image of vllm-project/vllm for inferencing, supporting features such as PD disaggregation, KV cache awareness and more.
Diff: v0.2.0 → v0.3.0
Image Variants: Different image variants of this component:
- XPU: ghcr.io/llm-d/llm-d-xpu:v0.3.0
- AWS: ghcr.io/llm-d/llm-d-aws:v0.3.0
  - Release v0.3.0 workaround for getting EFA to work
- CUDA: ghcr.io/llm-d/llm-d-cuda:v0.3.0
- GKE: ghcr.io/llm-d/llm-d-gke:v0.3.0
  - Release v0.3.0 workaround for running wide-ep on GKE with H200s

🔹 llm-d/llm-d-inference-sim

Description: A light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.
Diff: v0.3.0 → v0.5.1

🔹 llm-d-incubation/llm-d-infra

Description: A helm chart for deploying gateway and gateway related infrastructure assets for llm-d.
Diff: v1.1.1 → v1.3.3

🔹 kubernetes-sig/gateway-api-inference-extension

Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
Diff: v0.5.1 → v1.0.1

🔹 llm-d/llm-d-kv-cache-manager

Description: This repository contains the llm-d-kv-cache-manager, a pluggable service designed to enable KV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
Diff: v0.2.0 → v0.3.0

🔹 llm-d/llm-d-benchmark

Description: This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
Diff: v0.2.0 → v0.3.0

For more information on any of the component project or versions, please checkout their repos directly. For information on installing and using the new release refer to our guides. Thank you to all contributors who helped make this happen.

📦 llm-d v0.2.0 Release Notes

For information on installing and using the new release refer to our quickstarts.

Release Date: 2025-07-28

Core Objectives

This release had a few key objectives:

Migrate from monolithic to composable installs based on community feedback
Support wide expert parallelism cases of "one rank per node"
Align with upstream gateway-api-inference-extension helm charts

🧩 Component Summary

Component	Version	Previous Version	Type
llmd/llm-d-inference-scheduler	`v0.2.1`	`0.0.4`	Image
llm-d/llm-d-model-service	NA (Deprecated)	`0.0.10`	Image
llm-d-incubation/llm-d-modelservice	`v0.2.0`	NA (New)	Helm Chart
llm-d/llm-d-routing-sidecar	`v0.2.0`	`0.0.6`	Image
llm-d/llm-d-deployer	NA (Deprecated)	`1.0.22`	Helm Chart
vllm-project/vllm	`v0.10.0`	NA (built from fork)	Wheel installed in `llm-d`
llm-d/llm-d	`v0.2.0`	`0.0.8`	Image
llm-d/llm-d-inference-sim	`v0.3.0`	`0.0.4`	Image
llm-d-incubation/llm-d-infra	`v1.1.1`	NA (New)	Helm Chart
kubernetes-sig/gateway-api-inference-extension	`v0.5.1`	NA (New - external)	Image
llm-d/llm-d-kv-cache-manager	`v0.2.0`	`v0.1.0`	Go Package (consumed in `inference-scheduler`)
llm-d/llm-d-benchmark	`v0.2.0`	`v0.0.8`	Tooling and Image

NOTE: In future we want to support compatibility matrixes. However as we are still getting off the ground, we cannot ensure that these components work with legacy versions.

🔹 llm-d/llm-d-inference-scheduler

Description: The inference scheduler makes optimized routing decisions for inference requests to vLLM model servers. This component depends on the upstream gateway-api-inference-extension scheduling framework and includes features specific to vLLM.
Diff: 0.0.4 → v0.2.1
Upstream Changelog - since we bumped the upstream version of GIE many changes do not show up in the diff. The following has been pulled from release notes:

🔹 llm-d/llm-d-model-service (Deprecated)

Description: ModelService is a Kubernetes operator (CRD + controller) that enables the creation of vllm pods and routing resources for a given model.
Status: This repo is being deprecated as a component of llm-d and has been archived.
Replacement: lm-d-incubation/llm-d-modelservice

🔹 llm-d-incubation/llm-d-modelservice (New)

Description: modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).
History (new): v0.2.0

🔹 llm-d/llm-d-routing-sidecar

Description: A reverse proxy routing traffic between the inference-scheduler and the prefill and decode workers based on the x-prefiller-host-port HTTP request header.
Diff: 0.0.6 → v0.2.0

🔹 llm-d/llm-d-deployer (Deprecated)

Description: A repo containing examples, Helm charts, and release assets for llm-d.
Status: This repo is being deprecated as a component of llm-d however the repo will not be archived so that it may support people who want to try the legacy install. Will be minimally maintained.
Replacement: llm-d-incubation/llm-d-infra

🔹 vllm-project/vllm (Upstream)

Description: vLLM is a fast and easy-to-use library for LLM inference and serving. This project is the inferencing engine that forms the upstream of our llm-d/llm-d image.
Release: v0.10.0

🔹 llm-d/llm-d

Description: A midstreamed image of vllm-project/vllm for inferencing, supporting features such as PD disaggregation, KV cache awareness and more.
Diff: 0.0.8 → v0.2.0

🔹 llm-d/llm-d-inference-sim

Description: a light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.
Diff: 0.0.4 → v0.3.0

🔹 llm-d-incubation/llm-d-infra (New)

Description: This repository includes examples, Helm charts, and release assets for llm-d-infra.
History (new): v1.1.1

🔹 kubernetes-sig/gateway-api-inference-extension (New - Upstream)

Note: Note: The upstream project is a dependency of llm-d and we directly reference the published Helm charts. The release notes will not include all changes in this component from release to release, please consult with the upstream release pages.
Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
History (new - external): v0.5.1

🔹 llm-d/llm-d-kv-cache-manager

Description: This repository contains the llm-d-kv-cache-manager, a pluggable service designed to enable KV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
Diff: v0.1.0 → v0.2.0

🔹 llm-d/llm-d-benchmark

Description: This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
Diff: v0.0.8 → v0.2.0

For more information on any of the component project or versions, please checkout their repos directly. Thank you to all contributors who helped make this happen.

Releases: llm-d/llm-d

Release v0.4.0

📦 llm-d v0.4.0 Release Notes

🧩 Component Summary

🔹 lmd/llm-d-inference-scheduler

🔹 llm-d-incubation/llm-d-modelservice

🔹 llm-d/llm-d-routing-sidecar

🔹 llm-d/llm-d

🔹 llm-d-incubation/llm-d-infra

🔹 kubernetes-sig/gateway-api-inference-extension

🔹 llm-d/llm-d-workload-variant-autoscaler (New - Experimental)

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.1 Release

Release overview

Component version bumps:

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.0

📦 llm-d v0.3.0 Release Notes

Core Objectives

🧩 Component Summary

🔹 llm-d/llm-d-inference-scheduler

🔹 llm-d-incubation/llm-d-modelservice

🔹 llm-d/llm-d-routing-sidecar

🔹 vllm-project/vllm (upstream)

🔹 llm-d/llm-d

🔹 llm-d/llm-d-inference-sim

🔹 llm-d-incubation/llm-d-infra

🔹 kubernetes-sig/gateway-api-inference-extension

🔹 llm-d/llm-d-kv-cache-manager

🔹 llm-d/llm-d-benchmark

Uh oh!

v0.2.0

📦 llm-d v0.2.0 Release Notes

Core Objectives

🧩 Component Summary

🔹 llm-d/llm-d-inference-scheduler

🔹 llm-d/llm-d-model-service (Deprecated)

🔹 llm-d-incubation/llm-d-modelservice (New)

🔹 llm-d/llm-d-routing-sidecar

🔹 llm-d/llm-d-deployer (Deprecated)

🔹 vllm-project/vllm (Upstream)

🔹 llm-d/llm-d

🔹 llm-d/llm-d-inference-sim

🔹 llm-d-incubation/llm-d-infra (New)

🔹 kubernetes-sig/gateway-api-inference-extension (New - Upstream)

🔹 llm-d/llm-d-kv-cache-manager

🔹 llm-d/llm-d-benchmark

Uh oh!