Add docs for Model.create, update default values and fix per_worker concurrency#332
Add docs for Model.create, update default values and fix per_worker concurrency#332yunfeng-scale merged 11 commits intomainfrom
Conversation
clients/python/llmengine/model.py
Outdated
|
|
||
| checkpoint_path (`Optional[str]`): | ||
| Path to the checkpoint for the LLM. For now we only support loading a tar file from AWS S3. | ||
| Path to the checkpoint for the LLM. Can be either a folder (preferred since there's no untar) or a tar file. |
There was a problem hiding this comment.
Can be either a folder
I think we want to be precise and consistent around our guidance relating to remote files/directories. Would suggest cross-checking with the Files API and making sure that the explanation around this makes sense. e.g. we may want to clarify that this is meant to be some remote path that's accessible by the LLM Engine deployment, etc.
Can be either a folder (preferred since there's no untar)
Might be good to explain why skipping the untar is desirable - is it cold start time?
There was a problem hiding this comment.
checking code, i don't think we only support creating endpoints from files created with Files API.
yes skipping untar is good to cold start time
clients/python/llmengine/model.py
Outdated
| throughput requirements. 2. Determine a value for the maximum number of | ||
| concurrent requests in the workload. Divide this number by ``max_workers``. Doing | ||
| this ensures that the number of workers will "climb" to ``max_workers``. | ||
| Number of Uvicorn workers per pod. Recommendation is set to 2. |
There was a problem hiding this comment.
Might want to hide details of Uvicorn.
Also I don't think this is correct? IIRC per_worker is basically a scaling sensitivity parameter. cc @seanshi-scale
There was a problem hiding this comment.
hmm, this actually decides both # of workers and HPA target concurrency https://github.com/scaleapi/llm-engine/blob/main/model-engine/model_engine_server/infra/gateways/k8s_resource_parser.py#L65
i'll make some updates
There was a problem hiding this comment.
didn't understand why we want to set some ratio between HPA target and per_worker. removing the ratio
| import re | ||
| from typing import Union | ||
|
|
||
| MAX_CONCURRENCY_TO_TARGET_CONCURRENCY_RATIO = 2.0 |
There was a problem hiding this comment.
Wouldn't this technically change our autoscaling sensitivity? cc @seanshi-scale
There was a problem hiding this comment.
Which might be fine, just being aware of production behavior changes.
There was a problem hiding this comment.
yes this would but i think people are not aware of this ratio at all (like users specify per worker concurrency target to be 10 and HPA uses 5)
yixu34
left a comment
There was a problem hiding this comment.
One more high-level comment: after
…to yunfeng-retry-cds
yixu34
left a comment
There was a problem hiding this comment.
Only thing I had left were around the examples:
- framework version tag
- labels
Uh oh!
There was an error while loading. Please reload this page.