Increase max gpu utilization for 70b models#517
Conversation
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
5a2bb87 to
a59cf19
Compare
bc50329 to
1e17ab4
Compare
|
lgtm |
|
|
||
|
|
||
| @dataclass | ||
| class VLLMEngineArgs: |
There was a problem hiding this comment.
Hm I know this is by no means the main offender, but implementation specifics like vLLM aren't supposed to go into the use case layer. Granted, that'd require another layer, which I suspect @yunfeng-scale would find perfunctory 😁
There was a problem hiding this comment.
I guess I could just call it LLMEngineArgs. It seems right now we only support batch inference w/ vLLM, so we could try to do a proper abstraction when we decide we need to support it for a different engine?
There was a problem hiding this comment.
Yeah I think this is ok for now.
There was a problem hiding this comment.
oh 😅 you had a good point, the current code structure does not completely fit into clean architecture. in that sense we might want to move all these framework-specific code to another layer
…ization-for-70b-models
Pull Request Summary
What is this PR changing? Why is this change being made? Any caveats you'd like to highlight? Link any relevant documents, links, or screenshots here if applicable.
Up max gpu memory utilization to 0.95 for 70b models in attempt to address OOM issues
https://linear.app/scale-epd/issue/MLI-2309/use-095-gpu-memory-utilization-for-70b-models
Test Plan and Usage Guide
How did you validate that your PR works correctly? How do you run or demo the code? Provide enough detail so a reviewer can reasonably reproduce the testing procedure. Paste example command line invocations if applicable.
Published test docker image for batch_inference. Tested with API request using local gateway: job
ft-cp21h54gfe6g02mlqikg