Conversation
yixu34
left a comment
There was a problem hiding this comment.
Do we have a use case-level test that verifies this behavior? If there was a prior test that exercised infer_hardware_from_model_name, we can probably just route to that one.
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
| LLMInferenceFramework.TENSORRT_LLM: [], | ||
| } | ||
|
|
||
| # We need a dict where if we need to override we can |
There was a problem hiding this comment.
what's the reason we're getting rid of the max_model_len/max_num_batched_tokens args to vllm?
There was a problem hiding this comment.
i forgot why we're doing this in the first place, but i'm pretty certain in recent version of vLLM this is not needed. also checked most of these models, config.json has the same max_position_embeddings with the values here
…-engine into yunfeng-easy-model-creation
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
yixu34
left a comment
There was a problem hiding this comment.
- Cleanup prints
- Unrelated changes?
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/infra/gateways/abs_llm_artifact_gateway.py
Show resolved
Hide resolved
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
| f"Memory calculation result: {min_memory_gb=} for {model_name}, min_kv_cache_size: {min_kv_cache_size}, model_weights_size: {model_weights_size}" | ||
| ) | ||
|
|
||
| if min_memory_gb <= 24: |
There was a problem hiding this comment.
probably not really an issue, but how well do these map to Azure instance types?
There was a problem hiding this comment.
with my limited experiences on Azure, they do provide the same set of GPUs @squeakymouse wdyt?
| ) -> CreateDockerImageBatchJobResourceRequests: | ||
| config = llm_artifact_gateway.get_model_config(checkpoint_path) | ||
|
|
||
| dtype_size = 2 |
There was a problem hiding this comment.
guess we're gonna handle quantization later?
There was a problem hiding this comment.
quantization can already be handled here. but I chose not to update dtype_size since in my experiences, they usually make things slower not faster (at least for bitsandbytes and AWQ), so in order to achieve the same speed we still need the same amount of GPUs
|
|
||
| dtype_size = 2 | ||
|
|
||
| min_kv_cache_size = ( |
There was a problem hiding this comment.
IIUC we're implicitly setting this to "batch size = 1 and filling up the context window", this feels reasonable, but would it make sense to add a bit of room for a larger batch size? (I guess for something with a shorter context window e.g. llama 2 it makes more sense to add some room, maybe less so for mixtral)
There was a problem hiding this comment.
for all the existing models i tested, batch size = 1 is a good enough default to reach to reasonable amount of GPUs. model builders would have thought about the tradeoffs between model sizes and GPU sizes
| f"Num shard {num_shards} must be the same as number of GPUs {gpus} for DeepSpeed." | ||
| ) | ||
| if num_shards > gpus: | ||
| if num_shards != gpus: |
There was a problem hiding this comment.
should we just deprecate the num_shards field at this point?
There was a problem hiding this comment.
yes we should but would prefer not in this PR
| and request.storage is None | ||
| ): | ||
| raise ObjectHasInvalidValueException( | ||
| "All hardware spec fields (gpus, gpu_type, cpus, memory, storage) must be provided if any hardware spec field is missing." |
There was a problem hiding this comment.
nit: "... if any hardware spec field is provided"?
There was a problem hiding this comment.
#515 (comment)
i guess it works both ways. since here i'm only allowing two states: either all fields are provided, or none of them are provided
seanshi-scale
left a comment
There was a problem hiding this comment.
had a few questions/nits, but lgtm!
Pull Request Summary
Infer hardware specs from model name so these fields are optional now
formular used:
we hard code the param count for MoE models right now.
we omit some other small weights like embedding layer and MLP layer post transformer
this estimates is mostly correct with some issues on long context window models (investigation TBD)
Test Plan and Usage Guide
added unit tests and created llama 3 8b and codellama endpoint