⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️ by ArthurZucker · Pull Request #24565 · huggingface/transformers

ArthurZucker · 2023-06-29T02:10:16Z

What does this PR do?

Fixes the T5Tokenizer (not the fast one yet). (at the same time adresses part of #11531)
When converting UMT5 I created a reproduction snippet for any t5x model form the original repo. I realized that a very very small variation in the input completely changes the output for non-finetuned models. The issue lies with the way we process <extra_id_xx>.

Example:

# t5-base tokenizer
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 3, 5, 8774] # ['<extra_id_0>', ' ▁', '.', '▁Hello']
# seqio.SentencePieceVocabulary(vocab_path, extra_ids = 300)
>>> processor.encode("<extra_id_0>. Hello")
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

#after fix: 
>>> tokenizer.encode("<extra_id_0>. Hello", add_special_tokens = False)
[32099, 5, 8774] # ['<extra_id_0>', '.', '▁Hello']

The reason is that t5x wrapps arround sentencepiece, and adds the extra id to the vocab, but they are not saved that way.
We don't add them to the vocab, so when we tokenize, we split on special tokens, thus the sentencepiece model only sees:

>>> tokenizer.sp_model.encode(". Hello")
[273, 274, 9]

While the original model never sees a . (or a lot of other characters) alone, and thus we add an extra space...

This is a bug fix with regards to training, it is breaking in the sense that is should remove the space.

TODO:

Extra checks should be added to make sure this does not add anything else (like stripping a . This for example would break: tokenizer.encode(". Hello") as it remove the prefix space that is normally added.

HuggingFaceDocBuilderDev · 2023-06-29T02:34:11Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker · 2023-06-29T03:29:25Z

Actually switch t5 tests have to be updated!
This means I have to check if the models were trained with this extra token (if they used HF tokenizer) or not.

tests.models.instructblip.test_modeling_instructblip.InstructBlipModelIntegrationTest testMethod=test_inference_flant5_xl failing on main too so not related.....

tests.models.mt5.test_modeling_flax_mt5.MT5IntegrationTest also fails on main...
tests/models/t5/test_tokenization_t5.py the issue comes from the convert_slow modification. Need to investigate
- [ ] tests/models/t5/test_tokenization_t5.py:399 T5TokenizationTest.test_get_sentinel_token_ids_for_fasttokenizer
- [ ] tests/test_tokenization_common.py:3425 T5TokenizationTest.test_save_pretrained
- [ ] tests/models/t5/test_tokenization_t5.py:271 T5TokenizationTest.test_special_tokens_initialization

ArthurZucker · 2023-06-29T06:47:39Z

This can also be made non "breakable" with a flag. Up to debate since it is a bug fix.

sgugger

Thanks for the fix! Let's roll with it since it's a bug fix and if people complain about the breaking change we will see if we add a flag to enable the buggy behavior.

src/transformers/models/t5/tokenization_t5.py

Co-authored-by: Sylvain Gugger <[email protected]>

ArthurZucker · 2023-07-02T03:07:29Z

Edit: just to make sure, I did more testing and unfortunately , there is one bug:

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_', '_Hello', '<extra_id_0>']

instead of

>>>tokenizer.tokenize("Hello <extra_id_0>")
['_Hello', '<extra_id_0>']

This is because we have to prepend a _ instead of a space. (text = SPIECE_UNDERLINE + text. Not a single test caught this when runing pytest tests -k t5 which is interesting.
Fixing asap and adding tests. This is becoming very complex 😓

pointonjoel · 2023-07-20T11:34:28Z

I'm getting this legacy behaviour warning come up when simply loading a T5 tokenizer - it appears even before using the tokenizer. Is there an updated way to load the tokenizer? The warning appears when running the following lines of code:

from transformers import AutoTokenizer
tokeniser = AutoTokenizer.from_pretrained("google/mt5-small")

The error is:
You are using the legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at #24565
/usr/local/lib/python3.10/dist-packages/transformers/convert_slow_tokenizer.py:470: UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the byte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these unknown tokens into a sequence of byte tokens matching the original piece of text.
warnings.warn(

ArthurZucker · 2023-07-20T13:55:24Z

Yep, just set legacy=False. The goal of the warning is for you to decide wether or not you thing the legacy behaviour is alright with you or not.

ArthurZucker added 2 commits June 29, 2023 01:57

don't add space before single letter chars that don't have a merge

e03a768

fix the fix

76d6ab3

ArthurZucker added 4 commits June 29, 2023 02:35

fixup

5a7184b

add a test

baac7be

more testing

6e37601

fixup

b933328

ArthurZucker mentioned this pull request Jun 29, 2023

Adding custom tokens makes the T5Tokenizer always strip spaces #11531

Closed

4 tasks

hack to make sure fast is also fixed

d0cbc49

ArthurZucker marked this pull request as ready for review June 29, 2023 03:28

ArthurZucker added 2 commits June 29, 2023 04:36

update switch transformers test

50008ed

revert convert slow

5edf863

ArthurZucker requested review from Narsil and sgugger June 29, 2023 06:09

sgugger approved these changes Jun 29, 2023

View reviewed changes

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

src/transformers/models/t5/tokenization_t5.py Outdated Show resolved Hide resolved

ArthurZucker and others added 3 commits June 30, 2023 03:54

Update src/transformers/models/t5/tokenization_t5.py

17bda2c

Co-authored-by: Sylvain Gugger <[email protected]>

add typechecking

059999e

quality

8d3f2a2

ArthurZucker merged commit b52a03c into huggingface:main Jun 30, 2023

dtiarks mentioned this pull request Jun 30, 2023

Add UDOP #22940

Merged

4 tasks

This was referenced Jul 2, 2023

[Patch-t5-tokenizer] Patches the changes on T5 to make sure previous behaviour is still valide for beginning of words #24622

Merged

LlamaTokenizer: Slow implementation opts for whitespace-lead token (different from fast) #24569

Closed

hy928302776 mentioned this pull request Jul 13, 2023

启动模型 bash ./scripts/infer.sh 异常 jerry1993-tech/Cornucopia-LLaMA-Fin-Chinese#10

Open

ashercn97 mentioned this pull request Jul 18, 2023

Confusing Error Message (i think). axolotl-ai-cloud/axolotl#290

Closed

atillabasaran mentioned this pull request Jul 18, 2023

Legacy tokenizer artidoro/qlora#212

Open

This was referenced Jul 14, 2025

[Snyk] Fix for 4 vulnerabilities kingjay66/unilmf#256

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.52.0 kingjay66/unilmf#257

Open

rhystz mentioned this pull request Jul 15, 2025

Slow and low quality inference ShmuelRonen/ComfyUI-ThinkSound_Wrapper#3

Open

This was referenced Jul 17, 2025

[Snyk] Fix for 2 vulnerabilities kingjay66/unilmf#259

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.52.0 kingjay66/unilmf#260

Open

HarmlessDiva mentioned this pull request Jul 22, 2025

Flux Gym Windows 11 AMD GPU 7900 XTX "RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)" CS1o/Stable-Diffusion-Info#44

Open

socket-security bot mentioned this pull request Aug 1, 2025

Bump transformers from 4.53.2 to 4.54.1 alphasecio/prompt-guard#39

Merged

socket-security bot mentioned this pull request Aug 12, 2025

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#271

Open

TaffyCarl mentioned this pull request Aug 20, 2025

[Feature Request]: Enhance the Spanish User Interface DavidDragonsage/FooocusPlus#241

Closed

1 task

vladlen32230 mentioned this pull request Aug 21, 2025

[Bug]: VLLM does not work with gguf files vllm-project/vllm#23321

Closed

1 task

som-subh mentioned this pull request Aug 23, 2025

Error while running the command : python fineTuneSLM.py -d data_dir -o output_dir [-m model_name] [-e epochs] [-b batch_size] subharya83/acamethics#4

Closed

socket-security bot mentioned this pull request Sep 1, 2025

Bump transformers from 4.55.0 to 4.56.0 alphasecio/prompt-guard#43

Closed

moveforever mentioned this pull request Sep 5, 2025

error to run. help me Tencent-Hunyuan/HunyuanWorld-Voyager#8

Closed

This was referenced Sep 25, 2025

[Snyk] Security upgrade transformers from 4.30.2 to 4.53.0 kingjay66/unilmf#278

Open

[Snyk] Security upgrade transformers from 2.10.0 to 4.53.0 kingjay66/unilmf#279

Open

[Snyk] Security upgrade transformers from 4.5.1 to 4.53.0 kingjay66/unilmf#281

Open

PheonixAi420 mentioned this pull request Oct 7, 2025

Has anyone been able to use LoRA_Easy_Training_Scripts with a 50 series GPU? If so can you help me? derrian-distro/LoRA_Easy_Training_Scripts#344

Closed

pkbullock mentioned this pull request Oct 10, 2025

Sample Question - Error when Converting Phi 3.5 to QNN microsoft/Olive#2200

Open

socket-security bot mentioned this pull request Nov 1, 2025

Bump transformers from 4.56.2 to 4.57.1 alphasecio/prompt-guard#47

Closed

Vbird-light mentioned this pull request Nov 7, 2025

about the model weight chao1224/ProteinDT#10

Open

This was referenced Dec 16, 2025

TypeError: flex_flash_attn_func() got an unexpected keyword argument 'max_seqlen_q' SandAI-org/MAGI-1#119

Closed

TypeError: flex_flash_attn_func() got an unexpected keyword argument 'max_seqlen_q' SandAI-org/MagiAttention#193

Open

PaleKemono mentioned this pull request Dec 22, 2025

Chroma support? 67372a/LoRA_Easy_Training_Scripts#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️#24565

⚠️⚠️[`T5Tokenize`] Fix T5 family tokenizers⚠️⚠️#24565
ArthurZucker merged 12 commits intohuggingface:mainfrom
ArthurZucker:fix-t5-tokenizer

ArthurZucker commented Jun 29, 2023 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

Uh oh!

ArthurZucker commented Jun 29, 2023 •

edited

Loading

Uh oh!

ArthurZucker commented Jun 29, 2023

Uh oh!

sgugger left a comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Jul 2, 2023 •

edited

Loading

Uh oh!

pointonjoel commented Jul 20, 2023

Uh oh!

ArthurZucker commented Jul 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

ArthurZucker commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Jun 29, 2023

Uh oh!

sgugger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ArthurZucker commented Jul 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pointonjoel commented Jul 20, 2023

Uh oh!

ArthurZucker commented Jul 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ArthurZucker commented Jun 29, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jun 29, 2023 •

edited

Loading

ArthurZucker commented Jul 2, 2023 •

edited

Loading