Auto convert tekken.json #42299

ArthurZucker · 2025-11-20T14:03:37Z

What does this PR do?

If mistral-common is not installed, we can always convert to tokenizer.json
Don't convert if a tokenizer.json is here
Fix the tokenizer if its affected by the regex issue.

In [3]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
The tokenizer you are loading from 'mistralai/Mistral-Small-3.1-24B-Instruct-2503' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e.  This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

In [4]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503", fix_mistral_regex=True)
# no warning
In [2]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503", fix_mistral_regex=False)
# no warning either
In [3]: tok.fix_mistral_regex
Out[3]: False

In [4]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
The tokenizer you are loading from 'mistralai/Mistral-Small-3.1-24B-Instruct-2503' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e.  This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

In [5]: tok.fix_mistral_regex
Out[5]: False

Superseed #41592 and #41718 for now, fixes #41553, fixes #42283

HuggingFaceDocBuilderDev · 2025-11-20T14:12:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/tokenization_utils_base.py

Cyrilvallez

Nice! Just left a few comments!

Cyrilvallez · 2025-11-20T16:35:53Z

src/transformers/models/auto/tokenization_auto.py

-                else ("LlamaTokenizer" if is_sentencepiece_available() else None),
-                "LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
+                else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None),
+                "PreTrainedTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,


Usually the pattern is (slow, fast), here it's (fast, fast), not sure if intended
Maybe it should be instead:

(None, "MistralCommonTokenizer" if is_mistral_common_available() else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None)

so that we never have slow one anyway?

src/transformers/tokenization_utils_base.py

src/transformers/tokenization_utils_fast.py

src/transformers/tokenization_utils_base.py

…ixtral-common-thing

patrickvonplaten · 2025-11-24T10:15:06Z

src/transformers/tokenization_utils_base.py

+            def is_base_mistral(model_id: str) -> bool:
+                model = model_info(model_id)
+                if model.tags is not None:
+                    if re.search("base_model:.*mistralai", "".join(model.tags)):


Hmm so that's only for mistral org no? Should we directly check of `model_type in ["mistral" ....] so that it also works for other orgs?

We can't do that until we download the config / config is there

github-actions · 2025-11-24T12:08:12Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

* auto convert tekken.json * fix conversion * simplify * nit * model info based on the fly fix * up * last nit * fixup * call it fix mistral regex * fix behaviour for local or only tok is saved * style * rm comment at wrong palce * fix escaping * style * fix backend tokenizer attr to _tokenizer * update * up * update * fix the last red tests

CISC · 2025-11-25T10:58:31Z

src/transformers/tokenization_utils_base.py

+                    if transformers_version and version.parse(transformers_version) <= version.parse("4.57.2"):
+                        if _is_local and _config.model_type not in [
+                            "mistral",
+                            "mistral3",
+                            "voxstral",
+                            "ministral",
+                            "pixtral",
+                        ]:
+                            return tokenizer


The non-existent attribute use of _config.model_type is causing massive loading failures everywhere (including CIs), please consider making a hotfix ASAP. :)

Change it to _config.get("model_type")

I have no idea why the CI was full green

auto convert tekken.json

1f83f14

ArthurZucker added 3 commits November 20, 2025 15:17

fix conversion

798c29f

simplify

55c1652

nit

5f851fe

ArthurZucker marked this pull request as ready for review November 20, 2025 14:38

ArthurZucker commented Nov 20, 2025

View reviewed changes

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved

ArthurZucker added the for patch Tag issues / labels that should be included in the next patch label Nov 20, 2025

model info based on the fly fix

bdcde31

Cyrilvallez approved these changes Nov 20, 2025

View reviewed changes

patrickvonplaten reviewed Nov 20, 2025

View reviewed changes

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved

ArthurZucker added 11 commits November 24, 2025 10:19

up

fb41fe3

Merge branch 'main' of github.com:huggingface/transformers into fix-m…

70e8a37

…ixtral-common-thing

last nit

66d3b89

fixup

416f4c6

call it fix mistral regex

0d0484d

fix behaviour for local or only tok is saved

699fb5c

style

16a833f

rm comment at wrong palce

80a80ac

fix escaping

70e14eb

style

2f292b0

fix backend tokenizer attr to _tokenizer

56afbe2

patrickvonplaten reviewed Nov 24, 2025

View reviewed changes

ArthurZucker added 4 commits November 24, 2025 11:21

update

5689423

up

e0be8ad

update

214e3cf

fix the last red tests

865318a

ArthurZucker merged commit 6940b44 into main Nov 24, 2025
22 of 24 checks passed

ArthurZucker deleted the fix-mixtral-common-thing branch November 24, 2025 12:16

CISC reviewed Nov 25, 2025

View reviewed changes

chtruong814 mentioned this pull request Nov 25, 2025

Tokenizer fails to load from cache with local_files_only on 4.57.2 #42393

Closed

4 tasks

Auto convert tekken.json #42299

Auto convert tekken.json #42299

Uh oh!

Conversation

ArthurZucker commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 20, 2025

Uh oh!

Uh oh!

Cyrilvallez left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

patrickvonplaten Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 24, 2025

Uh oh!

Uh oh!

CISC Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thinkahead Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

ArthurZucker commented Nov 20, 2025 •

edited

Loading

CISC Nov 25, 2025 •

edited

Loading