Skip to content

Conversation

@ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Nov 20, 2025

What does this PR do?

  1. If mistral-common is not installed, we can always convert to tokenizer.json
  2. Don't convert if a tokenizer.json is here
  3. Fix the tokenizer if its affected by the regex issue.
In [3]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
The tokenizer you are loading from 'mistralai/Mistral-Small-3.1-24B-Instruct-2503' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e.  This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

In [4]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503", fix_mistral_regex=True)
# no warning
In [2]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503", fix_mistral_regex=False)
# no warning either
In [3]: tok.fix_mistral_regex
Out[3]: False

In [4]: tok = AutoTokenizer.from_pretrained("mistralai/Mistral-Small-3.1-24B-Instruct-2503")
The tokenizer you are loading from 'mistralai/Mistral-Small-3.1-24B-Instruct-2503' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e.  This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.

In [5]: tok.fix_mistral_regex
Out[5]: False

Superseed #41592 and #41718 for now, fixes #41553, fixes #42283

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker marked this pull request as ready for review November 20, 2025 14:38
@ArthurZucker ArthurZucker added the for patch Tag issues / labels that should be included in the next patch label Nov 20, 2025
Copy link
Member

@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Just left a few comments!

Comment on lines 753 to 754
else ("LlamaTokenizer" if is_sentencepiece_available() else None),
"LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None),
"PreTrainedTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the pattern is (slow, fast), here it's (fast, fast), not sure if intended
Maybe it should be instead:

(None, "MistralCommonTokenizer" if is_mistral_common_available() else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None)

so that we never have slow one anyway?

def is_base_mistral(model_id: str) -> bool:
model = model_info(model_id)
if model.tags is not None:
if re.search("base_model:.*mistralai", "".join(model.tags)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so that's only for mistral org no? Should we directly check of `model_type in ["mistral" ....] so that it also works for other orgs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that until we download the config / config is there

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

@ArthurZucker ArthurZucker merged commit 6940b44 into main Nov 24, 2025
22 of 24 checks passed
@ArthurZucker ArthurZucker deleted the fix-mixtral-common-thing branch November 24, 2025 12:16
Cyrilvallez pushed a commit that referenced this pull request Nov 24, 2025
* auto convert tekken.json

* fix conversion

* simplify

* nit

* model info based on the fly fix

* up

* last nit

* fixup

* call it fix mistral regex

* fix behaviour for local or only tok is saved

* style

* rm comment at wrong palce

* fix escaping

* style

* fix backend tokenizer attr to _tokenizer

* update

* up

* update

* fix the last red tests
Comment on lines +2470 to +2478
if transformers_version and version.parse(transformers_version) <= version.parse("4.57.2"):
if _is_local and _config.model_type not in [
"mistral",
"mistral3",
"voxstral",
"ministral",
"pixtral",
]:
return tokenizer
Copy link
Contributor

@CISC CISC Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-existent attribute use of _config.model_type is causing massive loading failures everywhere (including CIs), please consider making a hotfix ASAP. :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to _config.get("model_type")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea why the CI was full green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AutoTokenizer / Process does not support Magistral model Bad error message for AutoTokenizer loading Voxtral

7 participants