⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️#24565
⚠️⚠️[T5Tokenize] Fix T5 family tokenizers⚠️⚠️#24565ArthurZucker merged 12 commits intohuggingface:mainfrom
T5Tokenize] Fix T5 family tokenizers⚠️⚠️#24565Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
|
This can also be made non "breakable" with a flag. Up to debate since it is a bug fix. |
sgugger
left a comment
There was a problem hiding this comment.
Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.
Co-authored-by: Sylvain Gugger <[email protected]>
|
Edit: just to make sure, I did more testing and unfortunately , there is one bug: >>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']instead of >>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']This is because we have to prepend a |
|
I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code: from transformers import AutoTokenizer The error is: |
|
Yep, just set |

What does this PR do?
Fixes the
T5Tokenizer(not the fast one yet). (at the same time adresses part of #11531)When converting
UMT5I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process<extra_id_xx>.Example:
The reason is that t5x wrapps arround
sentencepiece, and adds the extra id to the vocab, but they are not saved that way.We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:
While the original model never sees a
.(or a lot of other characters) alone, and thus we add an extra space...This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.
TODO:
. This for example would break:tokenizer.encode(". Hello")as it remove the prefix space that is normally added.