One tok utils #42462

itazap · 2025-11-27T17:05:21Z

tokenization_python.py was always a copy of tokenization_utils.py

github-actions · 2025-11-27T17:13:29Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, barthez, gpt_neox_japanese

* fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]>

* remove zero_like + scatter * fix mixtral moe * fix other moe models as well * fix ci * fix modular mixtral * fix qwen2_moe + qwen3_next * fix device mismatch for qwen3_vl_moe to pass tests * fix modular mixtral * fix other models * rm slow tokenizers (#40936) * fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]> * [loading/saving] Reverse all loading operations when saving (#42396) * first shot * default to reversing * oupso * oupsi 2 * oupsi 3 * fix renamed kwargs * fix timm_wrapper * remove fix_state_dict methods * can do it all the time, with __init__ as well * doc * oupsi * fix * create helper * fix annotation annoying isue * small fix * small fixes * alright commit all that already * oupsi * the fix * update quantizers * this works * the hardcoded regex got me hard.... * style * the final one * cleanup a bit * better * style * oupsi readded it * do it inside the ops instead - no need for full names anymore * reverse quantizers and simplify signatures * small thingy * add no_grad decorator * utils to rename keys * oupssii again * add test * simplify nicely * Fix T5 tests: use generation_config for generation parameters (#42419) * pass the generation parameters to generate() * fix use_task_specific_params to separate model.config and model.generation_config params * fix style * some fixes * remove redundant check * update expectation for llama_7b_bf16 on rocm * Update tests/models/llama/test_modeling_llama.py Co-authored-by: Rémi Ouazan <[email protected]> --------- Co-authored-by: Rémi Ouazan <[email protected]> * linting * more fix to pass the CI tests * fix lfm2 moe * fix docstring * fix docstring * fix qwen like model * fix flex olmo * revert lfm2 moe config * make fixup * fix docstring * fix conversion mapping * fix inference of gpt-oss * add some fixes to gpt-oss (but still not good) * fix modular * we need errors I think * fix config issue * this was fixed --------- Co-authored-by: Ita Zaporozhets <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: Cyril Vallez <[email protected]> Co-authored-by: BADAOUI Abdennacer <[email protected]> Co-authored-by: Rémi Ouazan <[email protected]>

* fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (huggingface#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]>

* remove zero_like + scatter * fix mixtral moe * fix other moe models as well * fix ci * fix modular mixtral * fix qwen2_moe + qwen3_next * fix device mismatch for qwen3_vl_moe to pass tests * fix modular mixtral * fix other models * rm slow tokenizers (huggingface#40936) * fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (huggingface#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]> * [loading/saving] Reverse all loading operations when saving (huggingface#42396) * first shot * default to reversing * oupso * oupsi 2 * oupsi 3 * fix renamed kwargs * fix timm_wrapper * remove fix_state_dict methods * can do it all the time, with __init__ as well * doc * oupsi * fix * create helper * fix annotation annoying isue * small fix * small fixes * alright commit all that already * oupsi * the fix * update quantizers * this works * the hardcoded regex got me hard.... * style * the final one * cleanup a bit * better * style * oupsi readded it * do it inside the ops instead - no need for full names anymore * reverse quantizers and simplify signatures * small thingy * add no_grad decorator * utils to rename keys * oupssii again * add test * simplify nicely * Fix T5 tests: use generation_config for generation parameters (huggingface#42419) * pass the generation parameters to generate() * fix use_task_specific_params to separate model.config and model.generation_config params * fix style * some fixes * remove redundant check * update expectation for llama_7b_bf16 on rocm * Update tests/models/llama/test_modeling_llama.py Co-authored-by: Rémi Ouazan <[email protected]> --------- Co-authored-by: Rémi Ouazan <[email protected]> * linting * more fix to pass the CI tests * fix lfm2 moe * fix docstring * fix docstring * fix qwen like model * fix flex olmo * revert lfm2 moe config * make fixup * fix docstring * fix conversion mapping * fix inference of gpt-oss * add some fixes to gpt-oss (but still not good) * fix modular * we need errors I think * fix config issue * this was fixed --------- Co-authored-by: Ita Zaporozhets <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: Cyril Vallez <[email protected]> Co-authored-by: BADAOUI Abdennacer <[email protected]> Co-authored-by: Rémi Ouazan <[email protected]>

consolidate python and utils tokenization files, they are copies

5ea5b36

itazap changed the base branch from main to one_tokenizer November 27, 2025 17:06

ruff and ref

e885bbd

Format

43fec3c

ArthurZucker approved these changes Nov 27, 2025

View reviewed changes

ArthurZucker merged commit 5ce65b8 into one_tokenizer Nov 27, 2025
19 of 24 checks passed

ArthurZucker deleted the one_tok_utils branch November 27, 2025 17:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One tok utils #42462

One tok utils #42462

Uh oh!

itazap commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

One tok utils #42462

One tok utils #42462

Uh oh!

Conversation

itazap commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants