Skip to content

Releases: llm-d/llm-d

Release v0.4.0

26 Nov 20:19
04f3538

Choose a tag to compare

📦 llm-d v0.4.0 Release Notes

This release of the llm-d repo will capture the release for the entirety of the project, guides, components, and all.

Release Date: 2025-11-26


🧩 Component Summary

Component Version Previous Version Type
llmd/llm-d-inference-scheduler v0.4.0-rc.1 v0.3.1 Image
llm-d-incubation/llm-d-modelservice v0.3.8 v0.2.10 Helm Chart
llm-d/llm-d-routing-sidecar v0.4.0-rc.1 v0.3.1 Image
llm-d/llm-d-cuda v0.4.0 v0.3.1 Image
llm-d/llm-d-aws v0.4.0 v0.3.1 Image
llm-d/llm-d-xpu v0.4.0 v0.3.1 Image
llm-d/llm-d-cpu v0.4.0 v0.3.1 Image (New)
llm-d-incubation/llm-d-infra v1.3.4 v1.3.3 Helm Chart
kubernetes-sig/gateway-api-inference-extension v1.2.0-rc.1 v1.0.1 Helm Chart
llm-d/llm-d-workload-variant-autoscaler v0.0.8 NA (new) Helm Chart + Image

🔹 lmd/llm-d-inference-scheduler

  • Description: This scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.
  • Diff: v0.3.1 → v0.4.0-rc.1

🔹 llm-d-incubation/llm-d-modelservice

  • Description: modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).
  • Diff: v0.2.10 → v0.3.8

🔹 llm-d/llm-d-routing-sidecar

  • Description: A reverse proxy redirecting incoming requests to the prefill worker specified in the x-prefiller-host-port HTTP request header.
  • Diff: v0.3.1 → v0.4.0-rc.1

🔹 llm-d/llm-d

  • Description: A midstreamed image of vllm-project/vllm for inferencing, supporting features such as PD disaggregation, KV cache awareness and more.
  • Diff: v0.3.1 → v0.4.0
  • Image Variants: Different image variants of this component:
    • XPU: ghcr.io/llm-d/llm-d-xpu:v0.4.0
    • AWS: ghcr.io/llm-d/llm-d-aws:v0.4.0
    • CUDA: ghcr.io/llm-d/llm-d-cuda:v0.4.0
    • CPU: ghcr.io/llm-d/llm-d-cpu:v0.4.0

🔹 llm-d-incubation/llm-d-infra

  • Description: A helm chart for deploying gateway and gateway related infrastructure assets for llm-d.
  • Diff: v1.3.3 → v1.3.4

🔹 kubernetes-sig/gateway-api-inference-extension

  • Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
  • Diff: v1.0.1 → v1.2.0-rc.1

🔹 llm-d/llm-d-workload-variant-autoscaler (New - Experimental)

  • Description: Variant optimization autoscaler for distributed inference workloads
  • History (new): v0.0.5
  • Note: This is an experimental component being included in this release for early testing and feedback.

For more information on any of the component project or versions, please checkout their repos directly. For information on installing and using the new release refer to our guides. Thank you to all contributors who helped make this happen. Automated release notes will be included below, but it should be noted this only tracks work in the main repo, and does not fully reflect a changelog across the project

What's Changed

New Contributors

Full Changelog: v0.3.1...v0.4.0

v0.3.1 Release

06 Nov 15:28
267293b

Choose a tag to compare

Release overview

This release was focused on following up on our objectives from the v0.3.0 that could not make it into that release. A few key stories to highlight:

  • ARM support
  • Refactor image build process to scripts
  • Unifying the GKE image into our core Cuda image
  • Adding AKS cloud provider support

Welcome to all our new contributors, and thanks to the team for their hard work.

Component version bumps:

  • Inference SIM (v0.5.1 --> v0.6.1)
  • llm-d image (v0.3.0 --> v0.3.1, diff encapsulated in change-log below)

What's Changed

New Contributors

Full Changelog: v0.3.0...v0.3.1

v0.3.0

10 Oct 21:50
116093b

Choose a tag to compare

📦 llm-d v0.3.0 Release Notes

This release of the llm-d repo will capture the release for the entirety of the project, guides, components, and all.

Release Date: 2025-10-10


Core Objectives

This release had a few key objectives:

  • Increase support for specialized hardware backends (TPU, XPU)
  • Increase cloud provider support (DOKS)
  • Establish a metrics and observability story
  • Wide-ep optimizations (EPLB, DBO, Async Scheduling, etc.)

🧩 Component Summary

Component Version Previous Version Type
llm-d/llm-d-inference-scheduler v0.3.2 v0.2.1 Image
llm-d-incubation/llm-d-modelservice v0.2.10 v0.2.0 Helm Chart
llm-d/llm-d-routing-sidecar v0.3.0 v0.2.0 Image
vllm-project/vllm v0.11.0 v0.10.0 Editable install based on precompiled wheel
llm-d/llm-d-cuda v0.3.0 v0.2.0 Image
llm-d/llm-d-gke v0.3.0 NA (new) Image
llm-d/llm-d-aws v0.3.0 NA (new) Image
llm-d/llm-d-xpu v0.3.0 NA (new) Image
llm-d/llm-d-inference-sim v0.5.1 v0.3.0 Image
llm-d-incubation/llm-d-infra v1.3.3 v1.1.1 Helm Chart
kubernetes-sig/gateway-api-inference-extension v1.0.1 v0.5.1 Helm Chart
llm-d/llm-d-kv-cache-manager v0.3.2 v0.2.0 Go Package (consumed in inference-scheduler)
llm-d/llm-d-benchmark v0.3.0 v0.2.0 Tooling and Image

NOTE: In future we want to support compatibility matrixes. However as we are still getting off the ground, we cannot ensure that these components work with legacy versions.


🔹 llm-d/llm-d-inference-scheduler

  • Description: This scheduler that makes optimized routing decisions for inference requests to the llm-d inference framework.
  • Diff: v0.2.1 → v0.3.2

🔹 llm-d-incubation/llm-d-modelservice

  • Description: modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).
  • Diff: v0.2.0 → v0.2.10

🔹 llm-d/llm-d-routing-sidecar

  • Description: A reverse proxy redirecting incoming requests to the prefill worker specified in the x-prefiller-host-port HTTP request header.
  • Diff: v0.2.0 → v0.3.0

🔹 vllm-project/vllm (upstream)

  • Description: vLLM is a fast and easy-to-use library for LLM inference and serving. This project is the inferencing engine that forms the upstream of our llm-d/llm-d image.
  • Diff: v0.10.0 → v0.11.0

🔹 llm-d/llm-d

  • Description: A midstreamed image of vllm-project/vllm for inferencing, supporting features such as PD disaggregation, KV cache awareness and more.
  • Diff: v0.2.0 → v0.3.0
  • Image Variants: Different image variants of this component:
    • XPU: ghcr.io/llm-d/llm-d-xpu:v0.3.0
    • AWS: ghcr.io/llm-d/llm-d-aws:v0.3.0
      • Release v0.3.0 workaround for getting EFA to work
    • CUDA: ghcr.io/llm-d/llm-d-cuda:v0.3.0
    • GKE: ghcr.io/llm-d/llm-d-gke:v0.3.0
      • Release v0.3.0 workaround for running wide-ep on GKE with H200s

🔹 llm-d/llm-d-inference-sim

  • Description: A light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.
  • Diff: v0.3.0 → v0.5.1

🔹 llm-d-incubation/llm-d-infra

  • Description: A helm chart for deploying gateway and gateway related infrastructure assets for llm-d.
  • Diff: v1.1.1 → v1.3.3

🔹 kubernetes-sig/gateway-api-inference-extension

  • Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
  • Diff: v0.5.1 → v1.0.1

🔹 llm-d/llm-d-kv-cache-manager

  • Description: This repository contains the llm-d-kv-cache-manager, a pluggable service designed to enable KV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
  • Diff: v0.2.0 → v0.3.0

🔹 llm-d/llm-d-benchmark

  • Description: This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
  • Diff: v0.2.0 → v0.3.0

For more information on any of the component project or versions, please checkout their repos directly. For information on installing and using the new release refer to our guides. Thank you to all contributors who helped make this happen.

v0.2.0

29 Jul 14:52
2d4f622

Choose a tag to compare

📦 llm-d v0.2.0 Release Notes

For information on installing and using the new release refer to our quickstarts.

Release Date: 2025-07-28


Core Objectives

This release had a few key objectives:

  • Migrate from monolithic to composable installs based on community feedback
  • Support wide expert parallelism cases of "one rank per node"
  • Align with upstream gateway-api-inference-extension helm charts

🧩 Component Summary

Component Version Previous Version Type
llmd/llm-d-inference-scheduler v0.2.1 0.0.4 Image
llm-d/llm-d-model-service NA (Deprecated) 0.0.10 Image
llm-d-incubation/llm-d-modelservice v0.2.0 NA (New) Helm Chart
llm-d/llm-d-routing-sidecar v0.2.0 0.0.6 Image
llm-d/llm-d-deployer NA (Deprecated) 1.0.22 Helm Chart
vllm-project/vllm v0.10.0 NA (built from fork) Wheel installed in llm-d
llm-d/llm-d v0.2.0 0.0.8 Image
llm-d/llm-d-inference-sim v0.3.0 0.0.4 Image
llm-d-incubation/llm-d-infra v1.1.1 NA (New) Helm Chart
kubernetes-sig/gateway-api-inference-extension v0.5.1 NA (New - external) Image
llm-d/llm-d-kv-cache-manager v0.2.0 v0.1.0 Go Package (consumed in inference-scheduler)
llm-d/llm-d-benchmark v0.2.0 v0.0.8 Tooling and Image

NOTE: In future we want to support compatibility matrixes. However as we are still getting off the ground, we cannot ensure that these components work with legacy versions.


🔹 llm-d/llm-d-inference-scheduler

  • Description: The inference scheduler makes optimized routing decisions for inference requests to vLLM model servers. This component depends on the upstream gateway-api-inference-extension scheduling framework and includes features specific to vLLM.
  • Diff: 0.0.4 → v0.2.1
  • Upstream Changelog - since we bumped the upstream version of GIE many changes do not show up in the diff. The following has been pulled from release notes:

🔹 llm-d/llm-d-model-service (Deprecated)

  • Description: ModelService is a Kubernetes operator (CRD + controller) that enables the creation of vllm pods and routing resources for a given model.
  • Status: This repo is being deprecated as a component of llm-d and has been archived.
  • Replacement: lm-d-incubation/llm-d-modelservice

🔹 llm-d-incubation/llm-d-modelservice (New)

  • Description: modelservice is a Helm chart that simplifies LLM deployment on llm-d by declaratively managing Kubernetes resources for serving base models. It enables reproducible, scalable, and tunable model deployments through modular presets, and clean integration with llm-d ecosystem components (including vLLM, Gateway API Inference Extension, LeaderWorkerSet).
  • History (new): v0.2.0

🔹 llm-d/llm-d-routing-sidecar

  • Description: A reverse proxy routing traffic between the inference-scheduler and the prefill and decode workers based on the x-prefiller-host-port HTTP request header.
  • Diff: 0.0.6 → v0.2.0

🔹 llm-d/llm-d-deployer (Deprecated)

  • Description: A repo containing examples, Helm charts, and release assets for llm-d.
  • Status: This repo is being deprecated as a component of llm-d however the repo will not be archived so that it may support people who want to try the legacy install. Will be minimally maintained.
  • Replacement: llm-d-incubation/llm-d-infra

🔹 vllm-project/vllm (Upstream)

  • Description: vLLM is a fast and easy-to-use library for LLM inference and serving. This project is the inferencing engine that forms the upstream of our llm-d/llm-d image.
  • Release: v0.10.0

🔹 llm-d/llm-d

  • Description: A midstreamed image of vllm-project/vllm for inferencing, supporting features such as PD disaggregation, KV cache awareness and more.
  • Diff: 0.0.8 → v0.2.0

🔹 llm-d/llm-d-inference-sim

  • Description: a light weight vLLM simulator emulates responses to the HTTP REST endpoints of vLLM.
  • Diff: 0.0.4 → v0.3.0

🔹 llm-d-incubation/llm-d-infra (New)

  • Description: This repository includes examples, Helm charts, and release assets for llm-d-infra.
  • History (new): v1.1.1

🔹 kubernetes-sig/gateway-api-inference-extension (New - Upstream)

  • Note: Note: The upstream project is a dependency of llm-d and we directly reference the published Helm charts. The release notes will not include all changes in this component from release to release, please consult with the upstream release pages.
  • Description: A Helm chart to deploy an InferencePool, a corresponding EndpointPicker (epp) deployment, and any other related assets.
  • History (new - external): v0.5.1

🔹 llm-d/llm-d-kv-cache-manager

  • Description: This repository contains the llm-d-kv-cache-manager, a pluggable service designed to enable KV-Cache Aware Routing and lay the foundation for advanced, cross-node cache coordination in vLLM-based serving platforms.
  • Diff: v0.1.0 → v0.2.0

🔹 llm-d/llm-d-benchmark

  • Description: This repository provides an automated workflow for benchmarking LLM inference using the llm-d stack. It includes tools for deployment, experiment execution, data collection, and teardown across multiple environments and deployment styles.
  • Diff: v0.0.8 → v0.2.0

For more information on any of the component project or versions, please checkout their repos directly. Thank you to all contributors who helped make this happen.