Infer hardware from model name#515

Merged

yunfeng-scale merged 13 commits intomainfrom

yunfeng-easy-model-creation

May 15, 2024

Contributor

yunfeng-scale commented May 7, 2024 •

edited

Loading

Pull Request Summary

Infer hardware specs from model name so these fields are optional now

formular used:

dtype_size = 2 (for 16bit float)

min_kv_cache_size =
        2
        * dtype_size
        * config["num_hidden_layers"]
        * config["hidden_size"]
        * config["max_position_embeddings"]
        // (config["num_attention_heads"] // config["num_key_value_heads"])

model_weights_size = dtype_size * model_param_count_b * 1_000_000_000

min_memory_gb = math.ceil((min_kv_cache_size + model_weights_size) / 1_000_000_000 / 0.9)

we hard code the param count for MoE models right now.
we omit some other small weights like embedding layer and MLP layer post transformer
this estimates is mostly correct with some issues on long context window models (investigation TBD)

Test Plan and Usage Guide

added unit tests and created llama 3 8b and codellama endpoint


          Infer hardware from model name

93e8de4

yixu34 reviewed

View reviewed changes

Member

yixu34 left a comment

Do we have a use case-level test that verifies this behavior? If there was a prior test that exercised infer_hardware_from_model_name, we can probably just route to that one.

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Outdated Show resolved Hide resolved

yixu34 requested a review from seanshi-scale

May 7, 2024 22:50


          Merge remote-tracking branch 'origin/main' into yunfeng-easy-model-cr…

68f87de

…eation

seanshi-scale reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

                   LLMInferenceFramework.TENSORRT_LLM: [],
               }
-              # We need a dict where if we need to override we can

Contributor

seanshi-scale May 8, 2024

what's the reason we're getting rid of the max_model_len/max_num_batched_tokens args to vllm?

Contributor Author

yunfeng-scale May 8, 2024

i forgot why we're doing this in the first place, but i'm pretty certain in recent version of vLLM this is not needed. also checked most of these models, config.json has the same max_position_embeddings with the values here

yunfeng-scale and others added 5 commits

May 8, 2024 16:00


          Merge branch 'main' into yunfeng-easy-model-creation

ae3c67d


          Merge branch 'yunfeng-easy-model-creation' of github.com:scaleapi/llm…

1479d67

…-engine into yunfeng-easy-model-creation

fix

2bc3fc9


          fix lint

c137587

fix

ca5867f

yunfeng-scale requested review from seanshi-scale and yixu34

May 10, 2024 16:56

yunfeng-scale enabled auto-merge (squash)

May 10, 2024 17:08

yunfeng-scale disabled auto-merge

May 10, 2024 17:08

yixu34 reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Outdated Show resolved Hide resolved

yixu34 reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Outdated Show resolved Hide resolved

yixu34 reviewed

View reviewed changes

Member

yixu34 left a comment

👀

yunfeng-scale added 2 commits

May 14, 2024 14:15


          Use formula instead of hardcode

a1edf31


          Merge remote-tracking branch 'origin/main' into yunfeng-easy-model-cr…

…eation

yixu34 reviewed

View reviewed changes

Member

yixu34 left a comment

Cleanup prints
Unrelated changes?

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Outdated Show resolved Hide resolved

model-engine/model_engine_server/infra/gateways/abs_llm_artifact_gateway.py Show resolved Hide resolved

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py Outdated Show resolved Hide resolved

yunfeng-scale added 3 commits

May 14, 2024 15:20


          tests

ef19b97


          remove print and cache

0ec788f


          fixes

bc4a8e6

seanshi-scale reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

+                      f"Memory calculation result: {min_memory_gb=} for {model_name}, min_kv_cache_size: {min_kv_cache_size}, model_weights_size: {model_weights_size}"
+                  )
+                  if min_memory_gb <= 24:

Contributor

seanshi-scale May 15, 2024

probably not really an issue, but how well do these map to Azure instance types?

Contributor Author

yunfeng-scale May 15, 2024

with my limited experiences on Azure, they do provide the same set of GPUs @squeakymouse wdyt?

seanshi-scale reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

+              ) -> CreateDockerImageBatchJobResourceRequests:
+                  config = llm_artifact_gateway.get_model_config(checkpoint_path)
+                  dtype_size = 2

Contributor

seanshi-scale May 15, 2024

guess we're gonna handle quantization later?

Contributor Author

yunfeng-scale May 15, 2024

quantization can already be handled here. but I chose not to update dtype_size since in my experiences, they usually make things slower not faster (at least for bitsandbytes and AWQ), so in order to achieve the same speed we still need the same amount of GPUs

seanshi-scale reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py


		dtype_size = 2

		min_kv_cache_size = (

Contributor

seanshi-scale May 15, 2024

IIUC we're implicitly setting this to "batch size = 1 and filling up the context window", this feels reasonable, but would it make sense to add a bit of room for a larger batch size? (I guess for something with a shorter context window e.g. llama 2 it makes more sense to add some room, maybe less so for mixtral)

Contributor Author

yunfeng-scale May 15, 2024

for all the existing models i tested, batch size = 1 is a good enough default to reach to reasonable amount of GPUs. model builders would have thought about the tradeoffs between model sizes and GPU sizes

seanshi-scale reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

                               f"Num shard {num_shards} must be the same as number of GPUs {gpus} for DeepSpeed."
                           )
-                  if num_shards > gpus:
+                  if num_shards != gpus:

Contributor

seanshi-scale May 15, 2024

should we just deprecate the num_shards field at this point?

Contributor Author

yunfeng-scale May 15, 2024

yes we should but would prefer not in this PR

seanshi-scale reviewed

View reviewed changes

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

+                          and request.storage is None
+                      ):
+                          raise ObjectHasInvalidValueException(
+                              "All hardware spec fields (gpus, gpu_type, cpus, memory, storage) must be provided if any hardware spec field is missing."

Contributor

seanshi-scale May 15, 2024

nit: "... if any hardware spec field is provided"?

Contributor Author

yunfeng-scale May 15, 2024

#515 (comment)
i guess it works both ways. since here i'm only allowing two states: either all fields are provided, or none of them are provided

seanshi-scale approved these changes

View reviewed changes

Contributor

seanshi-scale left a comment

had a few questions/nits, but lgtm!


          Merge remote-tracking branch 'origin/main' into yunfeng-easy-model-cr…

a8a5ef0

…eation

yunfeng-scale merged commit ba68b8d into main

yunfeng-scale deleted the yunfeng-easy-model-creation branch

May 15, 2024 03:45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet