-
Notifications
You must be signed in to change notification settings - Fork 31.4k
Auto convert tekken.json #42299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto convert tekken.json #42299
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Cyrilvallez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Just left a few comments!
| else ("LlamaTokenizer" if is_sentencepiece_available() else None), | ||
| "LlamaTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None, | ||
| else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None), | ||
| "PreTrainedTokenizerFast" if is_tokenizers_available() and not is_mistral_common_available() else None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the pattern is (slow, fast), here it's (fast, fast), not sure if intended
Maybe it should be instead:
(None, "MistralCommonTokenizer" if is_mistral_common_available() else ("PreTrainedTokenizerFast" if is_tokenizers_available() else None)so that we never have slow one anyway?
…ixtral-common-thing
| def is_base_mistral(model_id: str) -> bool: | ||
| model = model_info(model_id) | ||
| if model.tags is not None: | ||
| if re.search("base_model:.*mistralai", "".join(model.tags)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so that's only for mistral org no? Should we directly check of `model_type in ["mistral" ....] so that it also works for other orgs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't do that until we download the config / config is there
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto |
* auto convert tekken.json * fix conversion * simplify * nit * model info based on the fly fix * up * last nit * fixup * call it fix mistral regex * fix behaviour for local or only tok is saved * style * rm comment at wrong palce * fix escaping * style * fix backend tokenizer attr to _tokenizer * update * up * update * fix the last red tests
| if transformers_version and version.parse(transformers_version) <= version.parse("4.57.2"): | ||
| if _is_local and _config.model_type not in [ | ||
| "mistral", | ||
| "mistral3", | ||
| "voxstral", | ||
| "ministral", | ||
| "pixtral", | ||
| ]: | ||
| return tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The non-existent attribute use of _config.model_type is causing massive loading failures everywhere (including CIs), please consider making a hotfix ASAP. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to _config.get("model_type")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sorry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea why the CI was full green
What does this PR do?
mistral-commonis not installed, we can always convert totokenizer.jsonSuperseed #41592 and #41718 for now, fixes #41553, fixes #42283