Skip to content

Conversation

@itazap
Copy link
Collaborator

@itazap itazap commented Nov 27, 2025

tokenization_python.py was always a copy of tokenization_utils.py

@itazap itazap changed the base branch from main to one_tokenizer November 27, 2025 17:06
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, barthez, gpt_neox_japanese

@ArthurZucker ArthurZucker merged commit 5ce65b8 into one_tokenizer Nov 27, 2025
19 of 24 checks passed
@ArthurZucker ArthurZucker deleted the one_tok_utils branch November 27, 2025 17:35
ArthurZucker added a commit that referenced this pull request Nov 27, 2025
* fixes missed

* gemma test fix

* refactor

* rm legacy from llama

* added renaming

* add _model

* update legacy

* update legacy

* fix docstring

* always load blank, then set _tokenizer if we have it

* new toks

* update all berttokenizer based models

* apply feedback - delete bert duplicates

* more models --> fast only

* more convert_slow models

* fix common test refs

* updating fast only tokenizers

* openai and pegasus

* enable sentencepiecebackend

* more models

* code gen

* t5

* code gen tests

* speecht5

* mbart

* mbart50

* more models

* more models

* layouglmv2

* update tests

* update tests

* update tests

* pretrainedtokenizer

* whisper

* whisper

* layoutxlm and storing backends

* refactor sentencepiecebackend and additional_special_tokens

* renaming tokenization_utils --> tokenization_python

* udpate tests

* bert test

* blenderbot

* clip

* codegen

* code_llama

* cohere

* deberata, deberat v2, funnel

* gpt2

* batch update tests

* pegasus qwen2 roberta

* more models

* layout tests

* some renaming

* fix references to utils_fast

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix some tests

* regression

* fix refs

* fix refs

* missed the most crucial file in my last commit

* fix refs

* fix refs

* fix refs

* batch encode fix

* fix some tests

* BC for batch_decode bc too many refs

* more tests

* fix more tests

* fix for processors

* fixing more models

* deleted mbart50 by accident

* seamless m4t

* albert fix

* whisper

* layout3

* attempt to fix cached tokenizers on CI

* trying another fix on CI

* again try to work around CI

* bertweet

* tapas

* mbart50

* luke

* mluke

* markuplm

* markuplm

* fix some more auto tests

* some random model failures

* mistralcommontestser

* more fixes

* ref fix

* siglip

* marian

* plbart

* update utils toks

* seamless m4t

* roc bert

* udpate byt5 test

* xlm

* esm

* roformer

* code llama

* biogpt

* m2m100

* dpr and flaubert

* xlm and speech to text

* tok backend pass object

* tokenizer object pass

* wav2vec2

* wav2vec2

* cpmant

* update utils tokenizers

* cpmant

* bartpho

* test apply chat template assistant mask

* apply chat template video

* apply chat template assistant mask

* test torch

* update from slow in base and fix donut processor errors

* auto to point to tokenizers backend, fix kosmos2

* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert

* missed file from last commit

* idefics2

* fixup

* fixup

* pretrained tokenizer fast test update

* stash

* bad merged

* cherry pick more stuff that did not merge well

* fix gptsw3

* nit warn for now

* update error raising

* just ran fixup

* bring back bert legacy

* fix

* nit

* fix 56 errors on blenderbotsmall?

* 18 for blenderbotsmall

* tok auto

* missed clip

* fix tests

* something missed

* token healing

* tok common tests update - nonmodel

* try to fix non-model test in test_tokenization_utils

* fix hub tests

* try to fix hub tests

* custom vocab related fixed

* bert jap

* BERT JAP

* rename bert legacy to bert legacy

* Wav2vec2

* fix in tok python to update total vocab size - fixes speech t5

* blender bot small

* forgot test file

* test failures

* marian

* gpt2 tiktoken

* big bird / marian

* udop

* forgot couple changes

* test_serve fix

* missing import

* a couple processors fixes

* style partly

* fix to fetch tests ci

* Revert branch back to commit f5bc69e state

* revert branch to styling

* update mistral after merge

* fixes for non model tests

* some processor test fixes

* more processor test fixes

* more processor fixes

* hub tests

* python tok utils

* fix hub test

* make style for now

* remove problemattic fic copies

* python utils/check_copies.py --fix_and_overwrite

* more styling

* fixup

* silence docstirng

* fix import?

* fix imports

* add the local test as well

* throw spm error

* llamas

* fix a couple tests

* broke ci

* broke ci

* broke ci

* broke ci

* add logs to debug gemma on ci

* gemma and llama

* gemma

* revert las commit

* gemma debug

* gemma debug

* gemma

* safely import spiece backend

* tok tests

* check none

* setup and qual

* ruff

* del dev files

* tok auto

* fill docstrings

* update auto

* blenderbot small nit

* add migration guide

* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`

* rename MistralCommonTokenizer to MistralCommonB ackend

* nit

* fix failures

* fixup

* remoove one old test

* mark the slow one as slow

* very small fixes

* update auto mapping for missing ones

* fixup lorsd

* fixup doc and stuff

* should be the final fixe

* processing update

* update

* FIX or brute AI fix the llava test

* style

* slow?

* fix is offline mode?

* fix mt5

* One tok utils (#42462)

* consolidate python and utils tokenization files, they are copies

* ruff and ref

* Format

* fix cohere

* ?

* up

* am I dumbb?

* grumble

---------

Co-authored-by: Arthur <[email protected]>
3outeille pushed a commit that referenced this pull request Nov 28, 2025
* fixes missed

* gemma test fix

* refactor

* rm legacy from llama

* added renaming

* add _model

* update legacy

* update legacy

* fix docstring

* always load blank, then set _tokenizer if we have it

* new toks

* update all berttokenizer based models

* apply feedback - delete bert duplicates

* more models --> fast only

* more convert_slow models

* fix common test refs

* updating fast only tokenizers

* openai and pegasus

* enable sentencepiecebackend

* more models

* code gen

* t5

* code gen tests

* speecht5

* mbart

* mbart50

* more models

* more models

* layouglmv2

* update tests

* update tests

* update tests

* pretrainedtokenizer

* whisper

* whisper

* layoutxlm and storing backends

* refactor sentencepiecebackend and additional_special_tokens

* renaming tokenization_utils --> tokenization_python

* udpate tests

* bert test

* blenderbot

* clip

* codegen

* code_llama

* cohere

* deberata, deberat v2, funnel

* gpt2

* batch update tests

* pegasus qwen2 roberta

* more models

* layout tests

* some renaming

* fix references to utils_fast

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix some tests

* regression

* fix refs

* fix refs

* missed the most crucial file in my last commit

* fix refs

* fix refs

* fix refs

* batch encode fix

* fix some tests

* BC for batch_decode bc too many refs

* more tests

* fix more tests

* fix for processors

* fixing more models

* deleted mbart50 by accident

* seamless m4t

* albert fix

* whisper

* layout3

* attempt to fix cached tokenizers on CI

* trying another fix on CI

* again try to work around CI

* bertweet

* tapas

* mbart50

* luke

* mluke

* markuplm

* markuplm

* fix some more auto tests

* some random model failures

* mistralcommontestser

* more fixes

* ref fix

* siglip

* marian

* plbart

* update utils toks

* seamless m4t

* roc bert

* udpate byt5 test

* xlm

* esm

* roformer

* code llama

* biogpt

* m2m100

* dpr and flaubert

* xlm and speech to text

* tok backend pass object

* tokenizer object pass

* wav2vec2

* wav2vec2

* cpmant

* update utils tokenizers

* cpmant

* bartpho

* test apply chat template assistant mask

* apply chat template video

* apply chat template assistant mask

* test torch

* update from slow in base and fix donut processor errors

* auto to point to tokenizers backend, fix kosmos2

* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert

* missed file from last commit

* idefics2

* fixup

* fixup

* pretrained tokenizer fast test update

* stash

* bad merged

* cherry pick more stuff that did not merge well

* fix gptsw3

* nit warn for now

* update error raising

* just ran fixup

* bring back bert legacy

* fix

* nit

* fix 56 errors on blenderbotsmall?

* 18 for blenderbotsmall

* tok auto

* missed clip

* fix tests

* something missed

* token healing

* tok common tests update - nonmodel

* try to fix non-model test in test_tokenization_utils

* fix hub tests

* try to fix hub tests

* custom vocab related fixed

* bert jap

* BERT JAP

* rename bert legacy to bert legacy

* Wav2vec2

* fix in tok python to update total vocab size - fixes speech t5

* blender bot small

* forgot test file

* test failures

* marian

* gpt2 tiktoken

* big bird / marian

* udop

* forgot couple changes

* test_serve fix

* missing import

* a couple processors fixes

* style partly

* fix to fetch tests ci

* Revert branch back to commit f5bc69e state

* revert branch to styling

* update mistral after merge

* fixes for non model tests

* some processor test fixes

* more processor test fixes

* more processor fixes

* hub tests

* python tok utils

* fix hub test

* make style for now

* remove problemattic fic copies

* python utils/check_copies.py --fix_and_overwrite

* more styling

* fixup

* silence docstirng

* fix import?

* fix imports

* add the local test as well

* throw spm error

* llamas

* fix a couple tests

* broke ci

* broke ci

* broke ci

* broke ci

* add logs to debug gemma on ci

* gemma and llama

* gemma

* revert las commit

* gemma debug

* gemma debug

* gemma

* safely import spiece backend

* tok tests

* check none

* setup and qual

* ruff

* del dev files

* tok auto

* fill docstrings

* update auto

* blenderbot small nit

* add migration guide

* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`

* rename MistralCommonTokenizer to MistralCommonB ackend

* nit

* fix failures

* fixup

* remoove one old test

* mark the slow one as slow

* very small fixes

* update auto mapping for missing ones

* fixup lorsd

* fixup doc and stuff

* should be the final fixe

* processing update

* update

* FIX or brute AI fix the llava test

* style

* slow?

* fix is offline mode?

* fix mt5

* One tok utils (#42462)

* consolidate python and utils tokenization files, they are copies

* ruff and ref

* Format

* fix cohere

* ?

* up

* am I dumbb?

* grumble

---------

Co-authored-by: Arthur <[email protected]>
ArthurZucker added a commit that referenced this pull request Dec 1, 2025
* remove zero_like + scatter

* fix mixtral moe

* fix other moe models as well

* fix ci

* fix modular mixtral

* fix qwen2_moe + qwen3_next

* fix device mismatch for qwen3_vl_moe to pass tests

* fix modular mixtral

* fix other models

* rm slow tokenizers (#40936)

* fixes missed

* gemma test fix

* refactor

* rm legacy from llama

* added renaming

* add _model

* update legacy

* update legacy

* fix docstring

* always load blank, then set _tokenizer if we have it

* new toks

* update all berttokenizer based models

* apply feedback - delete bert duplicates

* more models --> fast only

* more convert_slow models

* fix common test refs

* updating fast only tokenizers

* openai and pegasus

* enable sentencepiecebackend

* more models

* code gen

* t5

* code gen tests

* speecht5

* mbart

* mbart50

* more models

* more models

* layouglmv2

* update tests

* update tests

* update tests

* pretrainedtokenizer

* whisper

* whisper

* layoutxlm and storing backends

* refactor sentencepiecebackend and additional_special_tokens

* renaming tokenization_utils --> tokenization_python

* udpate tests

* bert test

* blenderbot

* clip

* codegen

* code_llama

* cohere

* deberata, deberat v2, funnel

* gpt2

* batch update tests

* pegasus qwen2 roberta

* more models

* layout tests

* some renaming

* fix references to utils_fast

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix some tests

* regression

* fix refs

* fix refs

* missed the most crucial file in my last commit

* fix refs

* fix refs

* fix refs

* batch encode fix

* fix some tests

* BC for batch_decode bc too many refs

* more tests

* fix more tests

* fix for processors

* fixing more models

* deleted mbart50 by accident

* seamless m4t

* albert fix

* whisper

* layout3

* attempt to fix cached tokenizers on CI

* trying another fix on CI

* again try to work around CI

* bertweet

* tapas

* mbart50

* luke

* mluke

* markuplm

* markuplm

* fix some more auto tests

* some random model failures

* mistralcommontestser

* more fixes

* ref fix

* siglip

* marian

* plbart

* update utils toks

* seamless m4t

* roc bert

* udpate byt5 test

* xlm

* esm

* roformer

* code llama

* biogpt

* m2m100

* dpr and flaubert

* xlm and speech to text

* tok backend pass object

* tokenizer object pass

* wav2vec2

* wav2vec2

* cpmant

* update utils tokenizers

* cpmant

* bartpho

* test apply chat template assistant mask

* apply chat template video

* apply chat template assistant mask

* test torch

* update from slow in base and fix donut processor errors

* auto to point to tokenizers backend, fix kosmos2

* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert

* missed file from last commit

* idefics2

* fixup

* fixup

* pretrained tokenizer fast test update

* stash

* bad merged

* cherry pick more stuff that did not merge well

* fix gptsw3

* nit warn for now

* update error raising

* just ran fixup

* bring back bert legacy

* fix

* nit

* fix 56 errors on blenderbotsmall?

* 18 for blenderbotsmall

* tok auto

* missed clip

* fix tests

* something missed

* token healing

* tok common tests update - nonmodel

* try to fix non-model test in test_tokenization_utils

* fix hub tests

* try to fix hub tests

* custom vocab related fixed

* bert jap

* BERT JAP

* rename bert legacy to bert legacy

* Wav2vec2

* fix in tok python to update total vocab size - fixes speech t5

* blender bot small

* forgot test file

* test failures

* marian

* gpt2 tiktoken

* big bird / marian

* udop

* forgot couple changes

* test_serve fix

* missing import

* a couple processors fixes

* style partly

* fix to fetch tests ci

* Revert branch back to commit f5bc69e state

* revert branch to styling

* update mistral after merge

* fixes for non model tests

* some processor test fixes

* more processor test fixes

* more processor fixes

* hub tests

* python tok utils

* fix hub test

* make style for now

* remove problemattic fic copies

* python utils/check_copies.py --fix_and_overwrite

* more styling

* fixup

* silence docstirng

* fix import?

* fix imports

* add the local test as well

* throw spm error

* llamas

* fix a couple tests

* broke ci

* broke ci

* broke ci

* broke ci

* add logs to debug gemma on ci

* gemma and llama

* gemma

* revert las commit

* gemma debug

* gemma debug

* gemma

* safely import spiece backend

* tok tests

* check none

* setup and qual

* ruff

* del dev files

* tok auto

* fill docstrings

* update auto

* blenderbot small nit

* add migration guide

* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`

* rename MistralCommonTokenizer to MistralCommonB ackend

* nit

* fix failures

* fixup

* remoove one old test

* mark the slow one as slow

* very small fixes

* update auto mapping for missing ones

* fixup lorsd

* fixup doc and stuff

* should be the final fixe

* processing update

* update

* FIX or brute AI fix the llava test

* style

* slow?

* fix is offline mode?

* fix mt5

* One tok utils (#42462)

* consolidate python and utils tokenization files, they are copies

* ruff and ref

* Format

* fix cohere

* ?

* up

* am I dumbb?

* grumble

---------

Co-authored-by: Arthur <[email protected]>

* [loading/saving] Reverse all loading operations when saving (#42396)

* first shot

* default to reversing

* oupso

* oupsi 2

* oupsi 3

* fix renamed kwargs

* fix timm_wrapper

* remove fix_state_dict methods

* can do it all the time, with __init__ as well

* doc

* oupsi

* fix

* create helper

* fix annotation annoying isue

* small fix

* small fixes

* alright commit all that already

* oupsi

* the fix

* update quantizers

* this works

* the hardcoded regex got me hard....

* style

* the final one

* cleanup a bit

* better

* style

* oupsi readded it

* do it inside the ops instead - no need for full names anymore

* reverse quantizers and simplify signatures

* small thingy

* add no_grad decorator

* utils to rename keys

* oupssii again

* add test

* simplify nicely

* Fix T5 tests: use generation_config for generation parameters (#42419)

* pass the generation parameters to generate()

* fix use_task_specific_params to separate model.config and model.generation_config params

* fix style

* some fixes

* remove redundant check

* update expectation for llama_7b_bf16 on rocm

* Update tests/models/llama/test_modeling_llama.py

Co-authored-by: Rémi Ouazan <[email protected]>

---------

Co-authored-by: Rémi Ouazan <[email protected]>

* linting

* more fix to pass the CI tests

* fix lfm2 moe

* fix docstring

* fix docstring

* fix qwen like model

* fix flex olmo

* revert lfm2 moe config

* make fixup

* fix docstring

* fix conversion mapping

* fix inference of gpt-oss

* add some fixes to gpt-oss (but still not good)

* fix modular

* we need errors I think

* fix config issue

* this was fixed

---------

Co-authored-by: Ita Zaporozhets <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Cyril Vallez <[email protected]>
Co-authored-by: BADAOUI Abdennacer <[email protected]>
Co-authored-by: Rémi Ouazan <[email protected]>
sarathc-cerebras pushed a commit to sarathc-cerebras/transformers that referenced this pull request Dec 7, 2025
* fixes missed

* gemma test fix

* refactor

* rm legacy from llama

* added renaming

* add _model

* update legacy

* update legacy

* fix docstring

* always load blank, then set _tokenizer if we have it

* new toks

* update all berttokenizer based models

* apply feedback - delete bert duplicates

* more models --> fast only

* more convert_slow models

* fix common test refs

* updating fast only tokenizers

* openai and pegasus

* enable sentencepiecebackend

* more models

* code gen

* t5

* code gen tests

* speecht5

* mbart

* mbart50

* more models

* more models

* layouglmv2

* update tests

* update tests

* update tests

* pretrainedtokenizer

* whisper

* whisper

* layoutxlm and storing backends

* refactor sentencepiecebackend and additional_special_tokens

* renaming tokenization_utils --> tokenization_python

* udpate tests

* bert test

* blenderbot

* clip

* codegen

* code_llama

* cohere

* deberata, deberat v2, funnel

* gpt2

* batch update tests

* pegasus qwen2 roberta

* more models

* layout tests

* some renaming

* fix references to utils_fast

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix some tests

* regression

* fix refs

* fix refs

* missed the most crucial file in my last commit

* fix refs

* fix refs

* fix refs

* batch encode fix

* fix some tests

* BC for batch_decode bc too many refs

* more tests

* fix more tests

* fix for processors

* fixing more models

* deleted mbart50 by accident

* seamless m4t

* albert fix

* whisper

* layout3

* attempt to fix cached tokenizers on CI

* trying another fix on CI

* again try to work around CI

* bertweet

* tapas

* mbart50

* luke

* mluke

* markuplm

* markuplm

* fix some more auto tests

* some random model failures

* mistralcommontestser

* more fixes

* ref fix

* siglip

* marian

* plbart

* update utils toks

* seamless m4t

* roc bert

* udpate byt5 test

* xlm

* esm

* roformer

* code llama

* biogpt

* m2m100

* dpr and flaubert

* xlm and speech to text

* tok backend pass object

* tokenizer object pass

* wav2vec2

* wav2vec2

* cpmant

* update utils tokenizers

* cpmant

* bartpho

* test apply chat template assistant mask

* apply chat template video

* apply chat template assistant mask

* test torch

* update from slow in base and fix donut processor errors

* auto to point to tokenizers backend, fix kosmos2

* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert

* missed file from last commit

* idefics2

* fixup

* fixup

* pretrained tokenizer fast test update

* stash

* bad merged

* cherry pick more stuff that did not merge well

* fix gptsw3

* nit warn for now

* update error raising

* just ran fixup

* bring back bert legacy

* fix

* nit

* fix 56 errors on blenderbotsmall?

* 18 for blenderbotsmall

* tok auto

* missed clip

* fix tests

* something missed

* token healing

* tok common tests update - nonmodel

* try to fix non-model test in test_tokenization_utils

* fix hub tests

* try to fix hub tests

* custom vocab related fixed

* bert jap

* BERT JAP

* rename bert legacy to bert legacy

* Wav2vec2

* fix in tok python to update total vocab size - fixes speech t5

* blender bot small

* forgot test file

* test failures

* marian

* gpt2 tiktoken

* big bird / marian

* udop

* forgot couple changes

* test_serve fix

* missing import

* a couple processors fixes

* style partly

* fix to fetch tests ci

* Revert branch back to commit f5bc69e state

* revert branch to styling

* update mistral after merge

* fixes for non model tests

* some processor test fixes

* more processor test fixes

* more processor fixes

* hub tests

* python tok utils

* fix hub test

* make style for now

* remove problemattic fic copies

* python utils/check_copies.py --fix_and_overwrite

* more styling

* fixup

* silence docstirng

* fix import?

* fix imports

* add the local test as well

* throw spm error

* llamas

* fix a couple tests

* broke ci

* broke ci

* broke ci

* broke ci

* add logs to debug gemma on ci

* gemma and llama

* gemma

* revert las commit

* gemma debug

* gemma debug

* gemma

* safely import spiece backend

* tok tests

* check none

* setup and qual

* ruff

* del dev files

* tok auto

* fill docstrings

* update auto

* blenderbot small nit

* add migration guide

* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`

* rename MistralCommonTokenizer to MistralCommonB ackend

* nit

* fix failures

* fixup

* remoove one old test

* mark the slow one as slow

* very small fixes

* update auto mapping for missing ones

* fixup lorsd

* fixup doc and stuff

* should be the final fixe

* processing update

* update

* FIX or brute AI fix the llava test

* style

* slow?

* fix is offline mode?

* fix mt5

* One tok utils (huggingface#42462)

* consolidate python and utils tokenization files, they are copies

* ruff and ref

* Format

* fix cohere

* ?

* up

* am I dumbb?

* grumble

---------

Co-authored-by: Arthur <[email protected]>
sarathc-cerebras pushed a commit to sarathc-cerebras/transformers that referenced this pull request Dec 7, 2025
* remove zero_like + scatter

* fix mixtral moe

* fix other moe models as well

* fix ci

* fix modular mixtral

* fix qwen2_moe + qwen3_next

* fix device mismatch for qwen3_vl_moe to pass tests

* fix modular mixtral

* fix other models

* rm slow tokenizers (huggingface#40936)

* fixes missed

* gemma test fix

* refactor

* rm legacy from llama

* added renaming

* add _model

* update legacy

* update legacy

* fix docstring

* always load blank, then set _tokenizer if we have it

* new toks

* update all berttokenizer based models

* apply feedback - delete bert duplicates

* more models --> fast only

* more convert_slow models

* fix common test refs

* updating fast only tokenizers

* openai and pegasus

* enable sentencepiecebackend

* more models

* code gen

* t5

* code gen tests

* speecht5

* mbart

* mbart50

* more models

* more models

* layouglmv2

* update tests

* update tests

* update tests

* pretrainedtokenizer

* whisper

* whisper

* layoutxlm and storing backends

* refactor sentencepiecebackend and additional_special_tokens

* renaming tokenization_utils --> tokenization_python

* udpate tests

* bert test

* blenderbot

* clip

* codegen

* code_llama

* cohere

* deberata, deberat v2, funnel

* gpt2

* batch update tests

* pegasus qwen2 roberta

* more models

* layout tests

* some renaming

* fix references to utils_fast

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix some tests

* regression

* fix refs

* fix refs

* missed the most crucial file in my last commit

* fix refs

* fix refs

* fix refs

* batch encode fix

* fix some tests

* BC for batch_decode bc too many refs

* more tests

* fix more tests

* fix for processors

* fixing more models

* deleted mbart50 by accident

* seamless m4t

* albert fix

* whisper

* layout3

* attempt to fix cached tokenizers on CI

* trying another fix on CI

* again try to work around CI

* bertweet

* tapas

* mbart50

* luke

* mluke

* markuplm

* markuplm

* fix some more auto tests

* some random model failures

* mistralcommontestser

* more fixes

* ref fix

* siglip

* marian

* plbart

* update utils toks

* seamless m4t

* roc bert

* udpate byt5 test

* xlm

* esm

* roformer

* code llama

* biogpt

* m2m100

* dpr and flaubert

* xlm and speech to text

* tok backend pass object

* tokenizer object pass

* wav2vec2

* wav2vec2

* cpmant

* update utils tokenizers

* cpmant

* bartpho

* test apply chat template assistant mask

* apply chat template video

* apply chat template assistant mask

* test torch

* update from slow in base and fix donut processor errors

* auto to point to tokenizers backend, fix kosmos2

* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert

* missed file from last commit

* idefics2

* fixup

* fixup

* pretrained tokenizer fast test update

* stash

* bad merged

* cherry pick more stuff that did not merge well

* fix gptsw3

* nit warn for now

* update error raising

* just ran fixup

* bring back bert legacy

* fix

* nit

* fix 56 errors on blenderbotsmall?

* 18 for blenderbotsmall

* tok auto

* missed clip

* fix tests

* something missed

* token healing

* tok common tests update - nonmodel

* try to fix non-model test in test_tokenization_utils

* fix hub tests

* try to fix hub tests

* custom vocab related fixed

* bert jap

* BERT JAP

* rename bert legacy to bert legacy

* Wav2vec2

* fix in tok python to update total vocab size - fixes speech t5

* blender bot small

* forgot test file

* test failures

* marian

* gpt2 tiktoken

* big bird / marian

* udop

* forgot couple changes

* test_serve fix

* missing import

* a couple processors fixes

* style partly

* fix to fetch tests ci

* Revert branch back to commit f5bc69e state

* revert branch to styling

* update mistral after merge

* fixes for non model tests

* some processor test fixes

* more processor test fixes

* more processor fixes

* hub tests

* python tok utils

* fix hub test

* make style for now

* remove problemattic fic copies

* python utils/check_copies.py --fix_and_overwrite

* more styling

* fixup

* silence docstirng

* fix import?

* fix imports

* add the local test as well

* throw spm error

* llamas

* fix a couple tests

* broke ci

* broke ci

* broke ci

* broke ci

* add logs to debug gemma on ci

* gemma and llama

* gemma

* revert las commit

* gemma debug

* gemma debug

* gemma

* safely import spiece backend

* tok tests

* check none

* setup and qual

* ruff

* del dev files

* tok auto

* fill docstrings

* update auto

* blenderbot small nit

* add migration guide

* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`

* rename MistralCommonTokenizer to MistralCommonB ackend

* nit

* fix failures

* fixup

* remoove one old test

* mark the slow one as slow

* very small fixes

* update auto mapping for missing ones

* fixup lorsd

* fixup doc and stuff

* should be the final fixe

* processing update

* update

* FIX or brute AI fix the llava test

* style

* slow?

* fix is offline mode?

* fix mt5

* One tok utils (huggingface#42462)

* consolidate python and utils tokenization files, they are copies

* ruff and ref

* Format

* fix cohere

* ?

* up

* am I dumbb?

* grumble

---------

Co-authored-by: Arthur <[email protected]>

* [loading/saving] Reverse all loading operations when saving (huggingface#42396)

* first shot

* default to reversing

* oupso

* oupsi 2

* oupsi 3

* fix renamed kwargs

* fix timm_wrapper

* remove fix_state_dict methods

* can do it all the time, with __init__ as well

* doc

* oupsi

* fix

* create helper

* fix annotation annoying isue

* small fix

* small fixes

* alright commit all that already

* oupsi

* the fix

* update quantizers

* this works

* the hardcoded regex got me hard....

* style

* the final one

* cleanup a bit

* better

* style

* oupsi readded it

* do it inside the ops instead - no need for full names anymore

* reverse quantizers and simplify signatures

* small thingy

* add no_grad decorator

* utils to rename keys

* oupssii again

* add test

* simplify nicely

* Fix T5 tests: use generation_config for generation parameters (huggingface#42419)

* pass the generation parameters to generate()

* fix use_task_specific_params to separate model.config and model.generation_config params

* fix style

* some fixes

* remove redundant check

* update expectation for llama_7b_bf16 on rocm

* Update tests/models/llama/test_modeling_llama.py

Co-authored-by: Rémi Ouazan <[email protected]>

---------

Co-authored-by: Rémi Ouazan <[email protected]>

* linting

* more fix to pass the CI tests

* fix lfm2 moe

* fix docstring

* fix docstring

* fix qwen like model

* fix flex olmo

* revert lfm2 moe config

* make fixup

* fix docstring

* fix conversion mapping

* fix inference of gpt-oss

* add some fixes to gpt-oss (but still not good)

* fix modular

* we need errors I think

* fix config issue

* this was fixed

---------

Co-authored-by: Ita Zaporozhets <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Cyril Vallez <[email protected]>
Co-authored-by: BADAOUI Abdennacer <[email protected]>
Co-authored-by: Rémi Ouazan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants