Skip to content

Support Mooncake migration backend for PD disaggregation#3620

Merged
lvhan028 merged 14 commits into
InternLM:mainfrom
Risc-lt:feat/mooncake
Jun 20, 2025
Merged

Support Mooncake migration backend for PD disaggregation#3620
lvhan028 merged 14 commits into
InternLM:mainfrom
Risc-lt:feat/mooncake

Conversation

@Risc-lt

@Risc-lt Risc-lt commented Jun 7, 2025

Copy link
Copy Markdown
Contributor

PD-Disaggregated KVCache Transfer Pipeline with Mooncake

This PR introduces a new implementation of the Prefill-Decode disaggregated KVCache transfer pipeline with LMDeploy, using native Mooncake components of transfer engine as an option other than dlslime. The goal is to enable disaggregated prefill/decode workloads across nodes for large-scale LLM inference, inspired by lmdeploy-distserve. #3304 (comment)


Architecture Overview

Interfaces

The Mooncake migration backend implementation expose interfaces below:

  • p2p_initialize: Notify Prefill & Decode Engines to initilize migration backend instance of Mooncake transfer engine.
  • register_memory_region: Register memory region for the connection
  • endpoint_info: Return local memry pool and endpoint configuartion info.
  • p2p_connect: Recieve endpoint infomation from the other side of connecting nodes.
  • p2p_migrate: Set up conection for prefill-decode nodes and transfer kvcache synchronously in read mode.

Control Plane

lmdeploy drawio

Proxy server firstly use FastAPI post to send the endpoint info to notify the prefill-decode servers to send their local endpoint info to the other one through TCP socket. After p2p-connection is established, Mooncake migration backend start to transfer kvcache through RDMA link.

Workflow

lmdeploy2 drawio


Current Status

  • Functional validation on A10 with eRDMA as RoCEv2 support.
  • All basic PD workflows (initialize $\Rightarrow$ connect $\Rightarrow$ prefill $\Rightarrow$ migrate $\Rightarrow$ decode) goes well as previous version of dlslime.

Next Steps

  • Check migration addresses for validating the quality of ouput tokens.
  • Improve the kvcache transferring efficiency to surpass dlslime version.
  • Remove unecessary testing logs.

How to Build

pip install mooncake-transfer-engine
pip install -v -e .

How to Run

Start Proxy

lmdeploy serve proxy   --server-name <proxy-ip-address>   --server-port 8000   --routing-strategy "min_expected_latency"   --serving-strategy DistServe   --log-level INFO

Start Prefill Engine

CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct   --server-name <server-ip-address>  --server-port 23333  --role Prefill   --proxy-url <proxy-ip-address:port>  --backend pytorch  --migration-backend Mooncake

Start Decode Engine

CUDA_VISIBLE_DEVICES=1 lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct   --server-name <server-ip-address>  --server-port 23334   --role Decode   --proxy-url <proxy-ip-address:port>   --backend pytorch  --migration-backend Mooncake

Client Side

curl -X POST "<proxy-ip-address:port>/v1/completions" -H "Content-Type: application/json" -d '{"model":"Qwen/Qwen2.5-7B-Instruct","temperature":0,"prompt":"Shanghai is a city that ","max_tokens":16,"stream":false}'

@stmatengss

Copy link
Copy Markdown

@Risc-lt You can use tools like ruff or yapf to automatically fix code formatting and linting issues.

@lvhan028

lvhan028 commented Jun 8, 2025

Copy link
Copy Markdown
Collaborator

The linting issue can be resolved by the following:

pip install pre-commit
cd lmdeploy # the root dir of lmdeploy repo
pre-commit run --all-files

Make sure that the python version is 3.10

@lvhan028 lvhan028 added the enhancement New feature or request label Jun 9, 2025
Comment thread lmdeploy/pytorch/disagg/backend/mooncake.py Outdated
Comment thread lmdeploy/pytorch/disagg/backend/mooncake.py Outdated
Comment thread lmdeploy/pytorch/disagg/backend/mooncake.py
"""Initialize p2p connection for this specific link."""
# TODO: Support more types of metadata_server
# e.g. "etcd://192.168.0.137:2379"
metadata_server = 'P2PHANDSHAKE'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is metadata_server used for?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two modes: (1) 'P2PHANDSHAKE' (a magic string): no metadata server for maintaining connection information, which is intended for small-scale PD disaggregation, and (2) support for etcd/redis/http_server as the centralized server for larger-scale PD disaggregation.

try:
from mooncake.engine import TransferEngine
except ImportError as e:
raise ImportError('Please install mooncake by following the instructions at '

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When passing --migration-backend Mooncake, it's better to raise an import error immediately if Mooncake is not installed during API server launch.
Can we put it in the constructor of MooncakeBackend?

Comment thread lmdeploy/pytorch/disagg/backend/mooncake.py Outdated
@Risc-lt

Risc-lt commented Jun 14, 2025

Copy link
Copy Markdown
Contributor Author

Having solved the problems above. NVLink support will be covered in next pr. cc @stmatengss @lvhan028 @JimyMa

Comment thread lmdeploy/pytorch/disagg/backend/mooncake.py Outdated
@lvhan028 lvhan028 merged commit 6cc314a into InternLM:main Jun 20, 2025
5 checks passed
grimoire pushed a commit to grimoire/lmdeploy that referenced this pull request Jun 26, 2025
* feat: init mooncake migration backend

* fix: mooncake endpoint and connect

* fix: add cli option for Mooncake

* feat: modify certralized mode into p2p and add endpoint building

* fix: transfer failure and rm rdma info and

* fix: run pre-commit to lint

* fix: remove unnecessary print log

* chore: lint code

* feat: add async migration mode

* chore: modify log printing and class overide

* chore: rm info in annotations

* fix: migration address key
oliveagle pushed a commit to oliveagle/lmdeploy that referenced this pull request May 22, 2026
* feat: init mooncake migration backend

* fix: mooncake endpoint and connect

* fix: add cli option for Mooncake

* feat: modify certralized mode into p2p and add endpoint building

* fix: transfer failure and rm rdma info and

* fix: run pre-commit to lint

* fix: remove unnecessary print log

* chore: lint code

* feat: add async migration mode

* chore: modify log printing and class overide

* chore: rm info in annotations

* fix: migration address key
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants