Skip to content

Conversation

@jxy
Copy link
Contributor

@jxy jxy commented Jul 21, 2023

We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content.

This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input.

With this, we can use EOS and BOS used in llama-2-chat models.

We accept an array of strings and numbers representing tokens,
in addition to the current string valued prompt or content.

This allows direct token input, so that any special tokens
can be processed and used at the frontend during the construction
of the json data, before sending to the server. And the server
does not need to know or parse special tokens from textual input.

With this, we can use EOS and BOS used in llama-2-chat models.
@jxy
Copy link
Contributor Author

jxy commented Jul 21, 2023

As an example, use this for llama-2-chat

$ curl --url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt":["[INST] <<SYS>>\nRespond\n<</SYS>>\n\nHi [/INST]  Hello!",2,1," [INST] Bye! [/INST]"]}'

@jxy jxy mentioned this pull request Jul 21, 2023
SlyEcho
SlyEcho previously approved these changes Jul 25, 2023
@ichernev
Copy link

I'm not sure exactly why, but latest master+this patch produces random output, and the server reports the prompt to be empty. If the patch is not applied it works normally. Also it doesn't get better with array prompt.

Tested on 70B-chat-q5_K_M from TheBloke

I also tested this on smaller models yesterday, and I'm pretty sure it worked... 🤯

@SlyEcho SlyEcho dismissed their stale review July 25, 2023 12:12

Needs to be tested

@SlyEcho
Copy link
Contributor

SlyEcho commented Jul 25, 2023

If it worked yesterday then maybe it's because of the latest changes I requested.

@jxy
Copy link
Contributor Author

jxy commented Jul 25, 2023

fixed a typo, please check again.

@ichernev
Copy link

fixed a typo, please check again.

Yep, it works now :)

@SlyEcho
Copy link
Contributor

SlyEcho commented Jul 25, 2023

I think the tokenization endpoint should not behave as it is right now. It was useful before to get the raw tokens but now it inserts a BOS and a space as well, changing the outcome.

@ichernev
Copy link

I think the tokenization endpoint should not behave as it is right now. It was useful before to get the raw tokens but now it inserts a BOS and a space as well, changing the outcome.

It makes sense, if you pass an array (with custom tokens) to not mangle the input in any way. For a pure-string prompt it makes a bit more sense to do some prep.

Now the first BOS is added automatically, but inter-conversation ones are added manually... If anything, it makes sense for the server to accept a list of strings (conversation) and add the special tokens and other formatting automatically (that would be a high level API for conversations).

@SlyEcho
Copy link
Contributor

SlyEcho commented Jul 25, 2023

It makes sense, if you pass an array (with custom tokens) to not mangle the input in any way. For a pure-string prompt it makes a bit more sense to do some prep.

That's my idea as well. It should be actually checking if there is a space in front as well, because it is only added as a quality measure.

But the /tokenize endpoint should just convert the text to tokens as-is. This way the result you can use in the prompt array in the next calls.

@jxy
Copy link
Contributor Author

jxy commented Jul 25, 2023

Checkout llama's tokenizer (sentencepiece).

>>> from llama.tokenizer import Tokenizer
>>> tokenizer = Tokenizer('tokenizer.model')
>>> tokenizer.encode('Hello',bos=True,eos=False)
[1, 15043]
>>> tokenizer.encode(' Hello',bos=True,eos=False)
[1, 29871, 15043]
>>> tokenizer.encode('  Hello',bos=True,eos=False)
[1, 259, 15043]

And what we have now

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}'
{"tokens":[1,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[1,29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[1,259,15043]}

with array

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["Hello"]}'
{"tokens":[1,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":[" Hello"]}'
{"tokens":[1,29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["  Hello"]}'
{"tokens":[1,259,15043]}

So

  1. we always have to add the space to be compatible with llama's tokenizer.
  2. whether we want to add bos/eos for tokenizer endpoint would be another question.

@SlyEcho
Copy link
Contributor

SlyEcho commented Jul 26, 2023

we always have to add the space to be compatible with llama's tokenizer.

I don't know if we should emulate sentencepiece, rather than providing an interface to llama.cpp's tokenizer.

Some other models may have a different behaviour as well. I tested OpenLLaMA and there "Hello" and " Hello" give two different token numbers, however the model seems to understand both of them 🤷

whether we want to add bos/eos for tokenizer endpoint would be another question.

Since they are fixed constants, the caller of the API can add them when they need them. To know what they are requires some information about the model, which we could give from some endpoint, but they have been pretty much the same in all models anyway.

@SlyEcho
Copy link
Contributor

SlyEcho commented Jul 26, 2023

works

@jxy
Copy link
Contributor Author

jxy commented Jul 26, 2023

Given existing confusions about the tokenizer in issues and PRs (#1501, #1931, #2023, #2310, #2315), it seems the best to make the tokenize endpoint compatible with LLaMA's Tokenizer or transformers.

Current behavior:

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"Hello"}'
{"tokens":[15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":" Hello"}'
{"tokens":[29871,15043]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":"  Hello"}'
{"tokens":[259,15043]}

which conforms to llama.tokenizer

>>> from llama.tokenizer import Tokenizer
>>> tokenizer = Tokenizer('tokenizer.model')
>>> tokenizer.encode('Hello',bos=False,eos=False)
[15043]
>>> tokenizer.encode(' Hello',bos=False,eos=False)
[29871, 15043]
>>> tokenizer.encode('  Hello',bos=False,eos=False)
[259, 15043]

The completion endpoint, on the other hand, prefixes a BOS under two conditions:

  1. the prompt is a string
  2. the prompt is an array and the first element of the prompt array is a string

If the llama.cpp's tokenizer changes in the future we can change server behavior accordingly.

@SlyEcho
Copy link
Contributor

SlyEcho commented Jul 26, 2023

The /tokenize endpoint should give raw output, like it does, so it will work like llama.cpp works internally.

Prepending space or BOS or whatever in /completion is probably not ideal, but since this PR adds the possibility to enter any tokens that the user wants, it solves all kinds of issues, like special token encoding etc.

@jxy
Copy link
Contributor Author

jxy commented Jul 26, 2023

Just tested it with the open assistant fine tune, openassistant-llama2-13b-orca-8k-3319.ggmlv3.q4_K_S.bin. This works.

$ curl http://localhost:8080/completion --header "Content-Type: application/json" \
--data '{"prompt":[1,32003,"Answer me.",2,32005,"What is the question to the answer to the ultimate question of life, the universe, and everything?",2,32001]}'

jxy referenced this pull request Jul 28, 2023
* add: server chat mode with llama2

* fix: remove the unnecessary last \n
@jxy
Copy link
Contributor Author

jxy commented Aug 4, 2023

We'd better have a way to use token number in stop too. How do you like nested arrays in json? Perhaps something like "stop": ["stop #1", [32000]], basically we can use array instead of plain strings.

@SlyEcho
Copy link
Contributor

SlyEcho commented Aug 4, 2023

Yeah, I think it could work but the stop checker is working on text level right now, so it would need to be enhanced.

@jxy
Copy link
Contributor Author

jxy commented Aug 5, 2023

With the OpenOrcaxOpenChat-Preview2-13B, it appears that the tokenizer always insert a space after special tokens.

From the model card,

# Single-turn V1 Llama 2
tokenize("User: Hello<|end_of_turn|>Assistant:")
# Result: [1, 4911, 29901, 15043, 32000, 4007, 22137, 29901]

From server,

$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["User: Hello",32000,"Assistant:"]}'
{"tokens":[4911,29901,15043,32000,7900,22137,29901]}
$ curl http://localhost:8080/tokenize --header "Content-Type: application/json" --data '{"content":["User: Hello",32000," Assistant:"]}'
{"tokens":[4911,29901,15043,32000,4007,22137,29901]}

@nahuel89p
Copy link

So I was puzzled at why the custom prompt wasn't preventing openorcaxopenchat-preview2-13b.ggmlv3 from generating a completely made up conversation after the first question. Good to know the reason and that it's being addressed, thanks!

@jhen0409 jhen0409 merged commit b8ad1b6 into ggml-org:master Aug 23, 2023
@jxy jxy deleted the prompt-array branch April 10, 2024 02:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants