LMDeploy Distserve#3304
Merged
Merged
Conversation
636fb5b to
94eee2b
Compare
RunningLeon
reviewed
Apr 7, 2025
…pus to ray.init for run in dlc
RunningLeon
reviewed
Apr 14, 2025
| @@ -1,5 +1,5 @@ | |||
| # Copyright (c) OpenMMLab. All rights reserved. | |||
|
|
|||
| from lmdeploy.disagg.messages import EngineRole, MigrationBackend, MigrationTransportProtocol | |||
Collaborator
There was a problem hiding this comment.
we can put this after line 307 to avoid unnecessary importing time
RunningLeon
reviewed
Apr 14, 2025
1. [PD Connection more efficiently][High Priority] In DSV3 DP + EP condition, we need to concurrently construct prefill_dp_size (for exampe 32) * decode_dp_size(for example 144) links. We add a function `pd_consolidation_multi_thread` to do this. However, we need to check if the construction operation is thread safe. 2. [Combine with proxy] Maybe we should save conn_config to avoid repeatly reconnection of PD Link. 3. [PD Control Plane][High Priority] For DP + EP, we need to reconstruct DisaggEngineConfig to record more information (e.g. dp_idx, tp_idx ...) 4. [Combine with router][Important] How to perform PD Load Balance in disaggregated LLM Serving. 5. [PD Data Plane] adapt to Open Source KVCache manager like Mooncake, infiniStore or NiXL and more transport media.
grimoire
reviewed
May 7, 2025
| self._loop_main = None | ||
|
|
||
| # for migration loop management | ||
| self.migration_event = asyncio.Event() |
Collaborator
There was a problem hiding this comment.
The engine is lazy started since we might not have the event loop when creating engine.
I don't know if it is safe to initialize asyncio.Event here.
grimoire
reviewed
May 7, 2025
grimoire
reviewed
May 7, 2025
| cache_block_ids = resp.data.get('cache_block_ids', None) | ||
| if resp.type == ResponseType.SUCCESS: | ||
| token_ids = resp.data['token_ids'].tolist() | ||
| token_ids = resp.data['token_ids'] |
Collaborator
There was a problem hiding this comment.
EngineInstance would output ndarray instead of list[int], is it acceptable @lvhan028 ?
grimoire
reviewed
May 7, 2025
grimoire
approved these changes
May 8, 2025
lvhan028
approved these changes
May 8, 2025
oliveagle
pushed a commit
to oliveagle/lmdeploy
that referenced
this pull request
May 22, 2026
* sync main * typo correct * 1. typo 2. add migration event * 1. move slime to 'https://github.com/JimyMa/DLSlime.git' and init readme. * Update disagg README * mute slime when disable distserve * remove build_migration.sh * revert debug code * 1. identify interface. 2. add multi backend registry * add dlslime max transfer batch * add an infinistore interface * add load/store * conditional register of Multi Migration Backend * merge router to proxy * remove redandunt print * 1. remove redandunt print 2. revert safe_run * dsv3 kvtransfer support (bypass v cache) * dsv3 debug, 1. change log info to log debug of log resp. 2. add num_cpus to ray.init for run in dlc * DSV3 Debug, known issue: 1. [PD Connection more efficiently][High Priority] In DSV3 DP + EP condition, we need to concurrently construct prefill_dp_size (for exampe 32) * decode_dp_size(for example 144) links. We add a function `pd_consolidation_multi_thread` to do this. However, we need to check if the construction operation is thread safe. 2. [Combine with proxy] Maybe we should save conn_config to avoid repeatly reconnection of PD Link. 3. [PD Control Plane][High Priority] For DP + EP, we need to reconstruct DisaggEngineConfig to record more information (e.g. dp_idx, tp_idx ...) 4. [Combine with router][Important] How to perform PD Load Balance in disaggregated LLM Serving. 5. [PD Data Plane] adapt to Open Source KVCache manager like Mooncake, infiniStore or NiXL and more transport media. * revert match to if,else * [bugfix] rename typo * [refactor] refactor pd_conn * 1. format code. 2. add engine_role for passing ut test * 1. format code 2. parse dp, ep, and dp rank to DisaggEngineConfig * 1. add pd conn timeout, 2. add default EngineRole to Hybrid, 3. fix disagg strategy proxy typo * 1. refactor PDConnection Pool * refactor debug * fix migration loop bug * add proxy arguments about distserve * bugfix * debug interface * remove unnesessary EngineRole Check. * add v1/chat/completions support * remove redundent print * async free cache * async free cache * 1. add some comments. * 1. bugfix * [proxy] add connection_warmup api * 1. bugfix (warmup_connection_typo and wrong args) 2. preserve cache bugfix * [disagg] update readme, 1. fault tolerance and 2. replace router to proxy. * bugfix * fix decode back pressure bug * 1. add migration_request to chat/completions for correctly cache free * 2. free cache bugfix * 1. fix lock running bug * 1. fix dist.broadcast deadlock * [lint] 1. fix lint * rename Ethernet to RoCE * change emun.Enum.__members__[elem] to enum.Enum[elem] directly * update readme * update migration-backend * 1. update readme 2. move module to string for conditional import * 1. update readme * 1. remove migic number and handle long assignments in dlslime. 2. add uniexecutor support * fix error migration in dummy situation * 1. bugfix when token is not a decodable utf-8 (in test) * 1. overlapping migration and forward. * bump dlslime to v0.0.1.post5 * remove print * remove free in decode engine because already freed in proxy * 1. bump dlslime to 0.0.1.post7 * 1. [proxy] revert self.nodes to nodes 2. [api_server] remove redundant api * 1. [cli] remove available_nic args * format comments * [pytorch paging] remove redundant logger * [model_agent] bugfix caused by merge * [model agent] bypass model agent migrate * revert migrate to sync mode * bypass model agent migrate in uni_executor * [proxy] set default serving strategy to DistServe * 1. [disagg] update readme * info -> debug * remove unused code * lazily initialize migration event * add nvlink support * mute TCP support by now * update readme for execption * set migration token_ids output to numpy array * update readme * In PD Disaggregation Mode, fallback next token ids to CPU * 1. [disagg] update readme * move disagg to pytorch backend
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What the lmdeploy-distserve is included:
State of lmdeploy-distserve:
Next Step
Initialization
The PD Consolidation process outlines the sequential steps for establishing peer-to-peer (P2P) connections between system components. The process begins with the Router , which acts as the central orchestrator. First, the Router initiates the connection setup by sending a p2p_initialize message to both the Prefill Server and the Decode Server . This ensures that all necessary components are prepared for the subsequent connection phase.
Once the initialization phase is complete for both the Prefill Server and the Decode Server , the Router proceeds to establish the actual P2P connections. It sends a p2p_connect message to the Prefill Server to finalize the connection, followed by another p2p_connect message to the Decode Server . This systematic approach ensures that all components are properly initialized before any connections are established, forming a cohesive network during the system's startup phase.
Control Plane
The diagram illustrates the workflow and interactions between various components involved in the system's prefill and decode processes. This process is designed to manage tasks efficiently, ensuring smooth operation and scalability.
Prefill Process:
The Prefill Server initiates the prefill process by sending a Prefill Message to the Prefill Engine .
The Prefill Engine processes the request and generates an
EngineOutput, which includes details such asFirstTokenand CahceBlockIds.The Prefill Scheduler receives the output from the Prefill Engine and manages task scheduling. Tasks are placed into a Waiting Queue with a status of
Status.WAITING.Once ready, the tasks are forwarded to the Forward Executor , which processes them with a status of
Status.RUNNING. The status will be converted toStatus.ToBeMigratedand will be free when decode engine migration done.Decode Process:
The Decode Server sends requests to the Decode Engine , which processes the input and generates an
EngineOutput. This output may include details likeGenToken. The Decode Scheduler manages the decoded tasks and places them into a Migration Queue with a status ofStatus.WaitingMigration. The Migration Executor processes these tasks, transitioning their status toStatus.Running. Completed tasks are then sent back to the Forward Executor for further processing (Prefill Enginecache_free).Key Features
This structured approach enables seamless coordination between components, facilitating efficient task execution and system control within the Control Plane .
Data Plane
The diagram illustrates the workflow and interactions between key components responsible for managing cache operations, migration, and load balancing. This process is designed to optimize data handling and ensure efficient resource utilization.
Prefill CacheEngine:
The Prefill CacheEngine handles caching operations for prefill tasks. It interacts with the
MigrationBackend.Storeto store cached data, which can be migrated or loaded as needed.Decode CacheEngine:
The Decode CacheEngine manages caching operations for decode tasks. It interacts with the MigrationBackend.Load to retrieve cached data when required.
Optional Store Component:
An optional Store component is included, which can be utilized for additional storage needs.This component may interact with the
MigrationBackend.Storeto manage persistent storage or backup mechanisms.Migration Operations:
Both the Prefill CacheEngine and Decode CacheEngine utilize the
MigrationBackend.Migratefunctionality to migrate cached data as necessary. This ensures that cached data can be efficiently moved between different storage locations or systems, maintaining data consistency and availability.Key Features
This structured approach enables seamless coordination between components, facilitating efficient data handling and system control within the Data Plane .
How to build
pip install dlslime==0.0.1.post2 pip install -v -e .How to Run
Step 1. Start Prefill Engine
Step 2. Start Decode Engine
Step 3. Start Router