-
Notifications
You must be signed in to change notification settings - Fork 682
[Feature] mm support prefix cache #4134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] mm support prefix cache #4134
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces comprehensive multimodal (MM) support for prefix caching, encoder caching, and processor caching, along with adjustments to V1 scheduling logic.
- Multimodal prefix cache support with hash-based block identification for image/video content
- Encoder cache management to store processed multimodal features and reduce redundant computations
- Processor cache system for storing preprocessed multimodal data with ZMQ-based IPC communication
Reviewed Changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/v1/cache_manager/test_prefix_cache.py | Test cases for multimodal prefix caching functionality |
| tests/v1/cache_manager/test_encoder_cache.py | Test cases for encoder cache management |
| fastdeploy/worker/worker_process.py | Added max_encoder_cache argument for worker configuration |
| fastdeploy/worker/gpu_model_runner.py | Encoder cache implementation and multimodal input processing |
| fastdeploy/scheduler/local_scheduler.py | V1 scheduler logic adjustments for better request handling |
| fastdeploy/scheduler/global_scheduler.py | V1 scheduler logic adjustments for global scheduling |
| fastdeploy/multimodal/hasher.py | Multimodal content hashing utility for cache identification |
| fastdeploy/input/preprocess.py | Added processor cache enablement parameter |
| fastdeploy/input/ernie4_5_vl_processor/process.py | Processor cache integration with ZMQ communication |
| fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py | Processor cache enablement in ERNIE processor |
| fastdeploy/entrypoints/openai/protocol.py | Added mm_hashes field to API protocol |
| fastdeploy/entrypoints/openai/api_server.py | Added max_processor_cache configuration |
| fastdeploy/entrypoints/engine_client.py | Processor cache configuration in engine client |
| fastdeploy/entrypoints/chat_utils.py | Updated chat message parsing for UUID-based multimodal content |
| fastdeploy/engine/sched/resource_manager_v1.py | Resource manager integration with multimodal caches |
| fastdeploy/engine/request.py | Added ImagePosition dataclass for multimodal positioning |
| fastdeploy/engine/engine.py | Added encoder cache configuration to worker startup |
| fastdeploy/engine/common_engine.py | Simplified available blocks calculation |
| fastdeploy/engine/args_utils.py | Added encoder and processor cache configuration arguments |
| fastdeploy/config.py | Configuration updates for encoder and processor cache settings |
| fastdeploy/cache_manager/prefix_cache_manager.py | Multimodal prefix cache implementation with hash-based block matching |
| fastdeploy/cache_manager/multimodal_cache_manager.py | Base classes for multimodal cache management |
| fastdeploy/cache_manager/cache_metrics.py | Updated log file name from prefix_cache_manager.log to cache_manager.log |
| fastdeploy/cache_manager/cache_data.py | Updated log file name from prefix_cache_manager.log to cache_manager.log |
| docs/zh/usage/log.md | Updated log file name documentation |
| docs/usage/log.md | Updated log file name documentation |
| max_encoder_cache: int = -1 | ||
| """ | ||
| Maximum number of tokens in the encoder cache. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的单位为啥会改成使用token而不是具体大小呢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
encoder的具体大小等于image_token * hidden_size * dtype,完全和token数正相关,这里表示方式是跟vllm一样的
| break | ||
| return can_schedule | ||
|
|
||
| def _update_mm_hashes(self, request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于图像的hash当前在测试时,耗时大概是多少呢,看后面是否有要优化
* Update expert_service.py * Update common_engine.py * Update expert_service.py
…li tool (PaddlePaddle#4558) * add collect-env * del files
…hreshold for cudagraph mode switching (PaddlePaddle#4578) * add new branch for sot * reorder * fix batch bug
* [XPU]Moe uses a new operator * [XPU]Moe uses a new operator * update response
* init * update code * fix code style & disable thinking * adapt for common_engine.update_mm_requests_chunk_size * use 3d rope * use flash_attn_unpadded * opt siglip * update to be compatible with the latest codebase * fix typo * optim OCR performance * fix bug * fix bug * fix bug * fix bug * normlize name * modify xpu rope * revert logger * fix bug * fix bug * fix bug * support default_v1 * optim performance * fix bug --------- Co-authored-by: root <[email protected]> Co-authored-by: zhangyue66 <[email protected]>
* add reasoning_tokens into usage info initial commit * add unit tests * modify unit test * modify and add unit tests * fix unit test * move steam usage to processor * modify processor * modify test_logprobs * modify test_logprobs.py * modify stream reasoning tokens accumulation * fix unit test
…Paddle#4531) * perf: Optimize task queue communication from engine to worker * perf: get_tasks to numpy * perf: get_tasks remove to_numpy * fix: request & replace ENV * remove test_e2w_perf.py * fix code style --------- Co-authored-by: Jiang-Jia-Jun <[email protected]>
prefix_cache_manager.logrenamed ascache_manager.log