-
Notifications
You must be signed in to change notification settings - Fork 14.1k
server: allow json array in prompt or content for direct token input #2306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content. This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input. With this, we can use EOS and BOS used in llama-2-chat models.
|
As an example, use this for llama-2-chat $ curl --url http://localhost:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt":["[INST] <<SYS>>\nRespond\n<</SYS>>\n\nHi [/INST] Hello!",2,1," [INST] Bye! [/INST]"]}' |
|
I'm not sure exactly why, but latest master+this patch produces random output, and the server reports the prompt to be empty. If the patch is not applied it works normally. Also it doesn't get better with array prompt. Tested on 70B-chat-q5_K_M from TheBloke I also tested this on smaller models yesterday, and I'm pretty sure it worked... 🤯 |
|
If it worked yesterday then maybe it's because of the latest changes I requested. |
|
fixed a typo, please check again. |
Yep, it works now :) |
|
I think the tokenization endpoint should not behave as it is right now. It was useful before to get the raw tokens but now it inserts a |
It makes sense, if you pass an array (with custom tokens) to not mangle the input in any way. For a pure-string prompt it makes a bit more sense to do some prep. Now the first BOS is added automatically, but inter-conversation ones are added manually... If anything, it makes sense for the server to accept a list of strings (conversation) and add the special tokens and other formatting automatically (that would be a high level API for conversations). |
That's my idea as well. It should be actually checking if there is a space in front as well, because it is only added as a quality measure. But the |
|
Checkout llama's tokenizer (sentencepiece). And what we have now with array So
|
I don't know if we should emulate sentencepiece, rather than providing an interface to llama.cpp's tokenizer. Some other models may have a different behaviour as well. I tested OpenLLaMA and there
Since they are fixed constants, the caller of the API can add them when they need them. To know what they are requires some information about the model, which we could give from some endpoint, but they have been pretty much the same in all models anyway. |
|
works |
|
Given existing confusions about the tokenizer in issues and PRs (#1501, #1931, #2023, #2310, #2315), it seems the best to make the tokenize endpoint compatible with LLaMA's Tokenizer or transformers. Current behavior: which conforms to llama.tokenizer The completion endpoint, on the other hand, prefixes a BOS under two conditions:
If the llama.cpp's tokenizer changes in the future we can change server behavior accordingly. |
|
The Prepending space or |
|
Just tested it with the open assistant fine tune, |
* add: server chat mode with llama2 * fix: remove the unnecessary last \n
|
We'd better have a way to use token number in |
|
Yeah, I think it could work but the stop checker is working on text level right now, so it would need to be enhanced. |
|
With the OpenOrcaxOpenChat-Preview2-13B, it appears that the tokenizer always insert a space after special tokens. From the model card, From server, |
|
So I was puzzled at why the custom prompt wasn't preventing |
We accept an array of strings and numbers representing tokens, in addition to the current string valued prompt or content.
This allows direct token input, so that any special tokens can be processed and used at the frontend during the construction of the json data, before sending to the server. And the server does not need to know or parse special tokens from textual input.
With this, we can use EOS and BOS used in llama-2-chat models.