count prompt tokens by saiatmakuri · Pull Request #366 · scaleapi/llm-engine

saiatmakuri · 2023-11-07T08:58:56Z

Pull Request Summary

For all completions, return the number of tokens in the prompt as num_prompt_tokens.

We get the count for prompt tokens in a waterfall method of fallbacks:

get prompt tokens count from the inference framework
load a tokenizer from HF weights
load a tokenizer from weights stored in S3

Test Plan and Usage Guide

Ran skipped integration test on local test. Need to figure out how to enable for CircleCI.
Compared token counts to https://huggingface.co/spaces/Xenova/the-tokenizer-playground

seanshi-scale · 2023-11-14T01:48:22Z

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py



+# Hack to count prompt tokens
+tokenizer_cache: Dict[str, AutoTokenizer] = {}


nit: would we want to make this an lru cache or something? idk how big the tokenizers can get

~3MB per model. good call to add lru cache

yixu34 · 2023-11-14T01:49:33Z

integration_tests/test_endpoints.py

            delete_model_endpoint(create_endpoint_request["name"], user)
+
+
+@pytest.mark.skip(reason="test doesn't currently work, needs to figure out s3 fallback")


Could we use moto? https://github.com/getmoto/moto

for the integration tests before deploy, I'd like to check s3 as well. I had changed this skip to since the comment to a skipif to just skip for the circleci env

model-engine/model_engine_server/common/dtos/llms.py

model-engine/model_engine_server/domain/gateways/llm_artifact_gateway.py

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

model-engine/model_engine_server/common/tokenizer_utils.py

yixu34 · 2023-11-14T18:04:50Z

model-engine/tests/unit/infra/gateways/test_s3_llm_artifact_gateway.py

+
+
+def mock_boto3_session(fake_files: List[str]):
+    mock_session = mock.Mock()


moto could be an option here, though this mocking can be fine too since you're already at the last layer, where you're in the S3 artifact gateway (want to avoid mocking across multiple layers).

ah moto seems cool, useful to know for the future. wanted to create a custom side effect so this leans itself more easily to that

yunfeng-scale · 2023-11-14T19:54:07Z

model-engine/model_engine_server/domain/use_cases/llm_model_endpoint_use_cases.py

                args["parameters"]["do_sample"] = False
            if request.return_token_log_probs:
                args["parameters"]["return_details"] = True
+            num_prompt_tokens = count_tokens(


i'm okay we do tokenization for less used frameworks, but for more important models can we move tokenization into the framework itself?

What about moving this fall-back tokenization into the forwarder? That would then have less overhead in the gateway, which also supports high QPS routes like get/post tasks.

yeah IMO the forwarder feels like the more "natural" place to put counting tokens, this way we'd only have to download one tokenizer in the forwarder, and we offload the computation to something that scales up more in proportion with load

this does mean that the forwarder is gonna have to know to carry out this token-counting logic exactly when it's forwarding to an LLM though, which will mean there are different "modes" for the forwarder (e.g. not-LLM, where it just passes requests through, and LLM, where it maybe does some specific processing and then passes requests through)

upstreaming all changes to the framework will be the goal, but this is a temporary stopgap

this does mean that the forwarder is gonna have to know to carry out this token-counting logic exactly when it's forwarding to an LLM though, which will mean there are different "modes" for the forwarder (e.g. not-LLM, where it just passes requests through, and LLM, where it maybe does some specific processing and then passes requests through)

thought about this for emitting token metrics in gateway vs forwarder as well. think its worth a larger discussion after this PR

yunfeng-scale

what's the current ephemeral disk size for model engine pods? should we add more?

model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py

saiatmakuri · 2023-11-15T06:46:57Z

what's the current ephemeral disk size for model engine pods? should we add more?

we've set it to 128Mi, this should be sufficient atm

yunfeng-scale · 2023-11-15T19:14:28Z

what's the current ephemeral disk size for model engine pods? should we add more?

we've set it to 128Mi, this should be sufficient atm

unclear to me whether this is enough since i think previously model engine got out of disk due to usage of 100MB space?

yixu34

Looks good for a V0 of token counting. If needed later on, we can consider in-framework tokenization and/or doing tokenization in the forwarder.

We should carefully monitor error rates and token latency/throughput along this rollout

model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py

yixu34 · 2023-11-15T21:53:42Z

model-engine/tests/unit/infra/gateways/test_s3_llm_artifact_gateway.py

+    "model_engine_server.infra.gateways.s3_llm_artifact_gateway.os.makedirs",
+    lambda *args, **kwargs: None,  # noqa
+)
+def test_s3_llm_artifact_gateway_download_folder(llm_artifact_gateway, fake_files):


Thank you for bumping up our test coverage! 🤜🏻 🤛🏻

saiatmakuri added 13 commits November 7, 2023 08:52

count prompt tokens, use tokenizer if needed

980dafe

merge conflict

fcbdca1

docstrings

b6ebaa3

fix tests and code cov

b02ac14

Merge branch 'main' into saiatmakuri/count-prompt-tokens

e1e9efa

add download files from s3 fn

37be04a

use same helpers and add docstring

c3bcc98

change to namedtuple

0d00465

add s3 repo locations

5dab779

fallback read from s3

1cc7381

refactor tokenizer laod

45e3502

edit tests

bd2bb73

refactor _SUPPORTED_MODELS_BY_FRAMEWORK

3e9123a

saiatmakuri requested review from song-william, yixu34 and yunfeng-scale November 14, 2023 01:30

saiatmakuri self-assigned this Nov 14, 2023

saiatmakuri added the enhancement New feature or request label Nov 14, 2023

updates for tests

1d12b73

seanshi-scale reviewed Nov 14, 2023

View reviewed changes

saiatmakuri added 10 commits November 14, 2023 04:10

move to utils file

04e078e

move some fns over

58c1fbf

use lru cache

0344a97

move model info

8d4ab8a

root to opt

e53394e

add log and adjust integration test

7842ec1

Merge branch 'main' into saiatmakuri/count-prompt-tokens

b610e13

refocus logs

f5da2f9

change empty string to optional

e13d65f

mock count tokens for unit tests

215f7ff

saiatmakuri added 2 commits November 14, 2023 11:46

add unit tests

a057792

config change

1c32f0f

yixu34 reviewed Nov 14, 2023

View reviewed changes

yunfeng-scale reviewed Nov 14, 2023

View reviewed changes

saiatmakuri added 5 commits November 14, 2023 20:07

comments pt 1

37d4c21

move internal logic to plugins file

c1d9f06

replace usage of utils file

a4019e1

rearrange test mock

3c1a4f5

only return prompt tokens count on last token in stream

5a672dd

saiatmakuri requested review from seanshi-scale, yixu34 and yunfeng-scale November 14, 2023 21:31

fix mock

8b3a5e9

yunfeng-scale approved these changes Nov 14, 2023

View reviewed changes

reorganize imports

4bb2d15

yunfeng-scale reviewed Nov 15, 2023

View reviewed changes

model-engine/model_engine_server/infra/repositories/live_tokenizer_repository.py Show resolved Hide resolved

saiatmakuri added 2 commits November 15, 2023 06:22

inject in external interfaces

723c786

make changes to tests

2f821fb

saiatmakuri added 5 commits November 15, 2023 06:59

Merge branch 'main' into saiatmakuri/count-prompt-tokens

8001dbe

fix tests

e8ff055

adjust test

4922b65

oops test

99d6dfb

add more tests

6916978

yixu34 approved these changes Nov 15, 2023

View reviewed changes

Merge branch 'main' into saiatmakuri/count-prompt-tokens

fd94350

saiatmakuri merged commit 257ea6c into main Nov 15, 2023

saiatmakuri deleted the saiatmakuri/count-prompt-tokens branch November 15, 2023 22:40

yunfeng-scale mentioned this pull request Nov 17, 2023

Some updates to integration tests #385

Merged



		# Hack to count prompt tokens
		tokenizer_cache: Dict[str, AutoTokenizer] = {}

		delete_model_endpoint(create_endpoint_request["name"], user)


		@pytest.mark.skip(reason="test doesn't currently work, needs to figure out s3 fallback")



		def mock_boto3_session(fake_files: List[str]):
		mock_session = mock.Mock()

Conversation

saiatmakuri commented Nov 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Test Plan and Usage Guide

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yunfeng-scale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

saiatmakuri commented Nov 15, 2023

Uh oh!

yunfeng-scale commented Nov 15, 2023

Uh oh!

yixu34 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

saiatmakuri commented Nov 7, 2023 •

edited

Loading