server: allow json array in prompt or content for direct token input #2306

jxy · 2023-07-21T05:48:34Z

We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content.

This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input.

With this, we can use EOS and BOS used in llama-2-chat models.

We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content. This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input. With this, we can use EOS and BOS used in llama-2-chat models.

jxy · 2023-07-21T05:56:43Z

As an example, use this for llama-2-chat

$ curl --url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt":["[INST] <<SYS>>\nRespond\n<</SYS>>\n\nHi [/INST]  Hello!",2,1," [INST] Bye! [/INST]"]}'

examples/server/server.cpp

ichernev · 2023-07-25T12:08:07Z

I'm not sure exactly why, but latest master+this patch produces random output, and the server reports the prompt to be empty. If the patch is not applied it works normally. Also it doesn't get better with array prompt.

Tested on 70B-chat-q5_K_M from TheBloke

I also tested this on smaller models yesterday, and I'm pretty sure it worked... 🤯

Needs to be tested

SlyEcho · 2023-07-25T12:13:18Z

If it worked yesterday then maybe it's because of the latest changes I requested.

jxy · 2023-07-25T12:48:17Z

fixed a typo, please check again.

ichernev · 2023-07-25T12:52:49Z

fixed a typo, please check again.

Yep, it works now :)

SlyEcho · 2023-07-25T13:10:43Z

I think the tokenization endpoint should not behave as it is right now. It was useful before to get the raw tokens but now it inserts a BOS and a space as well, changing the outcome.

ichernev · 2023-07-25T13:24:58Z

I think the tokenization endpoint should not behave as it is right now. It was useful before to get the raw tokens but now it inserts a BOS and a space as well, changing the outcome.

It makes sense, if you pass an array (with custom tokens) to not mangle the input in any way. For a pure-string prompt it makes a bit more sense to do some prep.

Now the first BOS is added automatically, but inter-conversation ones are added manually... If anything, it makes sense for the server to accept a list of strings (conversation) and add the special tokens and other formatting automatically (that would be a high level API for conversations).

SlyEcho · 2023-07-25T13:33:27Z

It makes sense, if you pass an array (with custom tokens) to not mangle the input in any way. For a pure-string prompt it makes a bit more sense to do some prep.

That's my idea as well. It should be actually checking if there is a space in front as well, because it is only added as a quality measure.

But the /tokenize endpoint should just convert the text to tokens as-is. This way the result you can use in the prompt array in the next calls.

jxy · 2023-07-25T22:25:14Z

Checkout llama's tokenizer (sentencepiece).

>>> from llama.tokenizer import Tokenizer
>>> tokenizer = Tokenizer('tokenizer.model')
>>> tokenizer.encode('Hello',bos=True,eos=False)
[1, 15043]
>>> tokenizer.encode(' Hello',bos=True,eos=False)
[1, 29871, 15043]
>>> tokenizer.encode('  Hello',bos=True,eos=False)
[1, 259, 15043]

And what we have now

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}'
{"tokens":[1,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[1,29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[1,259,15043]}

with array

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["Hello"]}'
{"tokens":[1,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":[" Hello"]}'
{"tokens":[1,29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["  Hello"]}'
{"tokens":[1,259,15043]}

So

we always have to add the space to be compatible with llama's tokenizer.
whether we want to add bos/eos for tokenizer endpoint would be another question.

SlyEcho · 2023-07-26T09:00:32Z

we always have to add the space to be compatible with llama's tokenizer.

I don't know if we should emulate sentencepiece, rather than providing an interface to llama.cpp's tokenizer.

Some other models may have a different behaviour as well. I tested OpenLLaMA and there "Hello" and " Hello" give two different token numbers, however the model seems to understand both of them 🤷

whether we want to add bos/eos for tokenizer endpoint would be another question.

Since they are fixed constants, the caller of the API can add them when they need them. To know what they are requires some information about the model, which we could give from some endpoint, but they have been pretty much the same in all models anyway.

SlyEcho · 2023-07-26T22:48:06Z

works

jxy · 2023-07-26T23:06:38Z

Given existing confusions about the tokenizer in issues and PRs (#1501, #1931, #2023, #2310, #2315), it seems the best to make the tokenize endpoint compatible with LLaMA's Tokenizer or transformers.

Current behavior:

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}'
{"tokens":[15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[259,15043]}

which conforms to llama.tokenizer

>>> from llama.tokenizer import Tokenizer
>>> tokenizer = Tokenizer('tokenizer.model')
>>> tokenizer.encode('Hello',bos=False,eos=False)
[15043]
>>> tokenizer.encode(' Hello',bos=False,eos=False)
[29871, 15043]
>>> tokenizer.encode('  Hello',bos=False,eos=False)
[259, 15043]

The completion endpoint, on the other hand, prefixes a BOS under two conditions:

the prompt is a string
the prompt is an array and the first element of the prompt array is a string

If the llama.cpp's tokenizer changes in the future we can change server behavior accordingly.

SlyEcho · 2023-07-26T23:15:36Z

The /tokenize endpoint should give raw output, like it does, so it will work like llama.cpp works internally.

Prepending space or BOS or whatever in /completion is probably not ideal, but since this PR adds the possibility to enter any tokens that the user wants, it solves all kinds of issues, like special token encoding etc.

jxy · 2023-07-26T23:30:48Z

Just tested it with the open assistant fine tune, openassistant-llama2-13b-orca-8k-3319.ggmlv3.q4_K_S.bin. This works.

$ curl http://localhost:8080/completion --header "Content-Type: application/json" \
--data '{"prompt":[1,32003,"Answer me.",2,32005,"What is the question to the answer to the ultimate question of life, the universe, and everything?",2,32001]}'

* add: server chat mode with llama2 * fix: remove the unnecessary last \n

jxy · 2023-08-04T03:09:26Z

We'd better have a way to use token number in stop too. How do you like nested arrays in json? Perhaps something like "stop": ["stop #1", [32000]], basically we can use array instead of plain strings.

SlyEcho · 2023-08-04T07:12:28Z

Yeah, I think it could work but the stop checker is working on text level right now, so it would need to be enhanced.

jxy · 2023-08-05T02:58:36Z

With the OpenOrcaxOpenChat-Preview2-13B, it appears that the tokenizer always insert a space after special tokens.

From the model card,

# Single-turn V1 Llama 2
tokenize("User: Hello<|end_of_turn|>Assistant:")
# Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901]

From server,

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["User: Hello",32000,"Assistant:"]}'
{"tokens":[4911,29901,15043,32000,7900,22137,29901]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["User: Hello",32000," Assistant:"]}'
{"tokens":[4911,29901,15043,32000,4007,22137,29901]}

nahuel89p · 2023-08-19T04:40:51Z

So I was puzzled at why the custom prompt wasn't preventing openorcaxopenchat-preview2-13b.ggmlv3 from generating a completely made up conversation after the first question. Good to know the reason and that it's being addressed, thanks!

examples/server/server.cpp

jxy mentioned this pull request Jul 21, 2023

[Feature request] How to support special tokens in tokenizer of llama_cpp? #1501

Closed

jxy mentioned this pull request Jul 21, 2023

Add llama 2 model #2262

Closed

SlyEcho suggested changes Jul 21, 2023

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

examples/server/server.cpp Outdated Show resolved Hide resolved

jxy added 2 commits July 24, 2023 21:39

server: use tokenizePrompt(json) and default "" if empty prompt

97deb25

Merge remote-tracking branch 'origin/master' into prompt-array

bba27ed

SlyEcho previously approved these changes Jul 25, 2023

View reviewed changes

server: fix prompt check

53c2db1

jxy added 2 commits July 26, 2023 17:42

server: tokenize endpoint no longer adds BOS

bb3770b

Merge remote-tracking branch 'origin/master' into prompt-array

a0f564f

SlyEcho approved these changes Jul 26, 2023

View reviewed changes

jxy referenced this pull request Jul 28, 2023

examples : server chat mode with llama2 (#2400)

34ae1ca

* add: server chat mode with llama2 * fix: remove the unnecessary last \n

Merge remote-tracking branch 'origin/master' into prompt-array

a7871ac

Merge remote-tracking branch 'origin/master' into prompt-array

0bf8b41

jxy mentioned this pull request Aug 22, 2023

The new tokenizer no longer encode space properly #2721

Closed

4 tasks

Merge remote-tracking branch 'origin/master' into prompt-array

88535ed

jhen0409 merged commit b8ad1b6 into ggml-org:master Aug 23, 2023

cebtenzzre reviewed Aug 31, 2023

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

jxy deleted the prompt-array branch April 10, 2024 02:46

wwoodsTM mentioned this pull request Oct 1, 2024

added implementation of DRY sampler (post-refactor) #9702

Merged

4 tasks

server: allow json array in prompt or content for direct token input #2306

server: allow json array in prompt or content for direct token input #2306

Uh oh!

Conversation

jxy commented Jul 21, 2023

Uh oh!

jxy commented Jul 21, 2023

Uh oh!

Uh oh!

Uh oh!

ichernev commented Jul 25, 2023

Uh oh!

SlyEcho commented Jul 25, 2023

Uh oh!

jxy commented Jul 25, 2023

Uh oh!

ichernev commented Jul 25, 2023

Uh oh!

SlyEcho commented Jul 25, 2023

Uh oh!

ichernev commented Jul 25, 2023

Uh oh!

SlyEcho commented Jul 25, 2023

Uh oh!

jxy commented Jul 25, 2023

Uh oh!

SlyEcho commented Jul 26, 2023

Uh oh!

SlyEcho commented Jul 26, 2023

Uh oh!

jxy commented Jul 26, 2023

Uh oh!

SlyEcho commented Jul 26, 2023

Uh oh!

jxy commented Jul 26, 2023

Uh oh!

jxy commented Aug 4, 2023

Uh oh!

SlyEcho commented Aug 4, 2023

Uh oh!

jxy commented Aug 5, 2023

Uh oh!

nahuel89p commented Aug 19, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants