Conversation
|
|
||
|
|
||
| # Hack to count prompt tokens | ||
| tokenizer_cache: Dict[str, AutoTokenizer] = {} |
There was a problem hiding this comment.
nit: would we want to make this an lru cache or something? idk how big the tokenizers can get
There was a problem hiding this comment.
~3MB per model. good call to add lru cache
integration_tests/test_endpoints.py
Outdated
| delete_model_endpoint(create_endpoint_request["name"], user) | ||
|
|
||
|
|
||
| @pytest.mark.skip(reason="test doesn't currently work, needs to figure out s3 fallback") |
There was a problem hiding this comment.
There was a problem hiding this comment.
for the integration tests before deploy, I'd like to check s3 as well. I had changed this skip to since the comment to a skipif to just skip for the circleci env
model-engine/model_engine_server/domain/gateways/llm_artifact_gateway.py
Outdated
Show resolved
Hide resolved
model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py
Outdated
Show resolved
Hide resolved
|
|
||
|
|
||
| def mock_boto3_session(fake_files: List[str]): | ||
| mock_session = mock.Mock() |
There was a problem hiding this comment.
moto could be an option here, though this mocking can be fine too since you're already at the last layer, where you're in the S3 artifact gateway (want to avoid mocking across multiple layers).
There was a problem hiding this comment.
ah moto seems cool, useful to know for the future. wanted to create a custom side effect so this leans itself more easily to that
| args["parameters"]["do_sample"] = False | ||
| if request.return_token_log_probs: | ||
| args["parameters"]["return_details"] = True | ||
| num_prompt_tokens = count_tokens( |
There was a problem hiding this comment.
i'm okay we do tokenization for less used frameworks, but for more important models can we move tokenization into the framework itself?
There was a problem hiding this comment.
What about moving this fall-back tokenization into the forwarder? That would then have less overhead in the gateway, which also supports high QPS routes like get/post tasks.
There was a problem hiding this comment.
yeah IMO the forwarder feels like the more "natural" place to put counting tokens, this way we'd only have to download one tokenizer in the forwarder, and we offload the computation to something that scales up more in proportion with load
this does mean that the forwarder is gonna have to know to carry out this token-counting logic exactly when it's forwarding to an LLM though, which will mean there are different "modes" for the forwarder (e.g. not-LLM, where it just passes requests through, and LLM, where it maybe does some specific processing and then passes requests through)
There was a problem hiding this comment.
upstreaming all changes to the framework will be the goal, but this is a temporary stopgap
There was a problem hiding this comment.
this does mean that the forwarder is gonna have to know to carry out this token-counting logic exactly when it's forwarding to an LLM though, which will mean there are different "modes" for the forwarder (e.g. not-LLM, where it just passes requests through, and LLM, where it maybe does some specific processing and then passes requests through)
thought about this for emitting token metrics in gateway vs forwarder as well. think its worth a larger discussion after this PR
yunfeng-scale
left a comment
There was a problem hiding this comment.
what's the current ephemeral disk size for model engine pods? should we add more?
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
we've set it to 128Mi, this should be sufficient atm |
unclear to me whether this is enough since i think previously model engine got out of disk due to usage of 100MB space? |
yixu34
left a comment
There was a problem hiding this comment.
Looks good for a V0 of token counting. If needed later on, we can consider in-framework tokenization and/or doing tokenization in the forwarder.
We should carefully monitor error rates and token latency/throughput along this rollout
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py
Show resolved
Hide resolved
| "model_engine_server.infra.gateways.s3_llm_artifact_gateway.os.makedirs", | ||
| lambda *args, **kwargs: None, # noqa | ||
| ) | ||
| def test_s3_llm_artifact_gateway_download_folder(llm_artifact_gateway, fake_files): |
There was a problem hiding this comment.
Thank you for bumping up our test coverage! 🤜🏻 🤛🏻
Pull Request Summary
For all completions, return the number of tokens in the prompt as
num_prompt_tokens.We get the count for prompt tokens in a waterfall method of fallbacks:
Test Plan and Usage Guide
Ran skipped integration test on local test. Need to figure out how to enable for CircleCI.
Compared token counts to https://huggingface.co/spaces/Xenova/the-tokenizer-playground