Improve TensorRT-LLM Functionality by seanshi-scale · Pull Request #487 · scaleapi/llm-engine

seanshi-scale · 2024-04-10T01:09:50Z

Pull Request Summary

Changes to get tensorrtllm to work with Mixtral

Update tensorrt llm included code/build processes to a newer version
Add some bits to mitigate some tokenization issues

Note: the logprobs returned aren't correct still, haven't investigated.

Test Plan and Usage Guide

Deployed a weights-only quantized Mixtral model, which works

…rrtllm backend repo, wip

…t spaces

…quences

seanshi-scale · 2024-04-30T01:16:05Z

...ngine/model_engine_server/inference/tensorrt-llm/triton_model_repo/postprocessing/1/model.py

+                output = self.tokenizer.decode(
+                    tokens[:seq_len],
+                    skip_special_tokens=self.skip_special_tokens)
+                # Adapted from https://github.com/triton-inference-server/tensorrtllm_backend/pull/423


Differs from NVIDIA here

seanshi-scale · 2024-04-30T01:16:31Z

...engine/model_engine_server/inference/tensorrt-llm/triton_model_repo/preprocessing/1/model.py

                item_flat_ids += ids
                item_offsets.append(len(ids))

+                # Add a case where ids[0] decodes to empty string, then add another set of ids here


Differs from NVIDIA here

why do we do this?

This is a partial patch to get some of the stop sequence behavior more functional; when I was trying stop sequences out, I noticed that there were cases where the stop sequence was being ignored. Root cause was because trt tokenizes the stop sequences, and looks for that exact sequence of tokens; but there was a case where what's returned from the model isn't quite that stop sequence, i.e. there's an extra empty token when trt tokenizes the stop sequence. So I just patched it so that we also look for the original sequence minus the empty token.

ex. if you pass in a stop sequence of text that tokenizes to [1,2,3], the model can output the sequence [..., 2,3], where [2,3] also decodes to text.

even this doesn't seem right? stop sequence could be a part of [2,3], e.g. tok2 = abc, tok3 = def, stop sequence = cdef. i think best is to compare in postprocessing with string?

yup, this isn't a complete fix for stop sequences for sure unfortunately; think we'd need to spend more time to see if it's possible with the current framework and how to do it

seanshi-scale · 2024-05-03T01:27:14Z

model-engine/tests/unit/domain/test_llm_use_cases.py

+    fake_model_endpoint_service.sync_model_endpoint_inference_gateway.response = SyncEndpointPredictV1Response(
+        status=TaskStatus.SUCCESS,
+        result={
+            "result": '{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":" Machine learning is a branch"}'


may need to figure out why the log probs are not returned properly

yunfeng-scale · 2024-05-10T19:48:53Z

...ngine/model_engine_server/inference/tensorrt-llm/triton_model_repo/postprocessing/1/model.py

+            for beam_idx, tokens in enumerate(beam_tokens):
+                seq_len = sequence_lengths[batch_idx][beam_idx]
+                output = self.tokenizer.decode(
+                    tokens[:seq_len], skip_special_tokens=self.skip_special_tokens


why do we restrict to [:seq_len], what are in tokens that outside of seq_len?

It's present in the original code https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.8.0/all_models/inflight_batcher_llm/postprocessing/1/model.py#L192

yunfeng-scale · 2024-05-10T19:52:32Z

...ngine/model_engine_server/inference/tensorrt-llm/triton_model_repo/postprocessing/1/model.py

+                # Adapted from https://github.com/triton-inference-server/tensorrtllm_backend/pull/423
+                # This is somewhat of a hack: add a space before the output if the first token starts with a space
+                # This may add a space in front of the first token though when we don't want it.
+                token_id_string = self.tokenizer.convert_ids_to_tokens(


this is expensive though? we're effectively decoding twice. if we're only checking string for first token, can we just do self.tokenizer.convert_ids_to_tokens( tokens[0], skip_special_tokens=self.skip_special_tokens )?

changed this to just look at the first token

yunfeng-scale · 2024-05-10T19:54:47Z

...engine/model_engine_server/inference/tensorrt-llm/triton_model_repo/preprocessing/1/model.py

                item_flat_ids += ids
                item_offsets.append(len(ids))

+                # Add a case where ids[0] decodes to empty string, then add another set of ids here


why do we do this?

yunfeng-scale · 2024-05-15T20:46:44Z

...engine/model_engine_server/inference/tensorrt-llm/triton_model_repo/preprocessing/1/model.py

                item_flat_ids += ids
                item_offsets.append(len(ids))

+                # Add a case where ids[0] decodes to empty string, then add another set of ids here


even this doesn't seem right? stop sequence could be a part of [2,3], e.g. tok2 = abc, tok3 = def, stop sequence = cdef. i think best is to compare in postprocessing with string?

yunfeng-scale · 2024-05-15T20:48:29Z

...ngine/model_engine_server/inference/tensorrt-llm/triton_model_repo/postprocessing/1/model.py

+        for batch_idx, beam_tokens in enumerate(tokens_batch):
+            for beam_idx, tokens in enumerate(beam_tokens):
+                seq_len = sequence_lengths[batch_idx][beam_idx]
+                output = self.tokenizer.decode(


should we check for stop token here?

I don't remember seeing stop tokens in the output when testing at least.

seanshi-scale added 9 commits April 10, 2024 00:45

fix up some issues I encountered while trying Mixtral

ccc7cf7

add changes to tput bm script to get trt llm to work locally

2bd356e

add num completion tokens for trt llm

5ff1693

try updating the trt triton_model_repo code to tag: v0.8.0 from tenso…

a1a1d16

…rrtllm backend repo, wip

change some outputs

258bc99

throw in random values for tensorrt_llm_bls

28fc2b5

at this point, the built triton image respects stop sequences, but no…

eb173f8

…t spaces

add in a hack to get streaming requests to not drop spaces hopefully

99795d5

try out another somewhat-hack to get the tokenizer to respect stop se…

563b128

…quences

seanshi-scale self-assigned this Apr 30, 2024

seanshi-scale commented Apr 30, 2024

View reviewed changes

seanshi-scale added 3 commits May 2, 2024 02:57

add supported model

0121939

fix bug with new trt version

0bc9376

new test + add a new case for TRT response handling

3ba38d0

seanshi-scale commented May 3, 2024

View reviewed changes

seanshi-scale added 7 commits May 2, 2024 18:36

comment + handle float output_log_probs

f5bcd4c

add test

2b2075f

black

902a2c9

readme

c5b0bd1

ruff

fd0130f

add some comments

62b19d4

Merge branch 'main' into seanshi/20240409-tensorrtllm-improvements

10f6e4e

seanshi-scale marked this pull request as ready for review May 9, 2024 23:00

seanshi-scale requested a review from yunfeng-scale May 9, 2024 23:00

seanshi-scale changed the title ~~Seanshi/20240409 tensorrtllm improvements~~ Improve TensorRT-LLM Functionality May 9, 2024

seanshi-scale requested a review from squeakymouse May 9, 2024 23:02

yunfeng-scale reviewed May 10, 2024

View reviewed changes

don't need to decode the entire thing twice

498fca7

seanshi-scale requested a review from yunfeng-scale May 14, 2024 17:24

Merge branch 'main' into seanshi/20240409-tensorrtllm-improvements

3ef8fca

yunfeng-scale approved these changes May 15, 2024

View reviewed changes

seanshi-scale merged commit 1470aac into main May 15, 2024

seanshi-scale deleted the seanshi/20240409-tensorrtllm-improvements branch May 15, 2024 21:03

Conversation

seanshi-scale commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Summary

Test Plan and Usage Guide

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seanshi-scale commented Apr 10, 2024 •

edited

Loading