rm slow tokenizers #40936

itazap · 2025-09-17T11:56:18Z

Tokenization

Just as we moved towards a single backend library for model definition, we want Tokenizer to be a lot more intuitive.
With v5, you can now initialize an empty LlamaTokenizer and train it directly on your new task!

Defining a new tokenizer object should be as simple as this:

from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE

class Llama5Tokenizer(TokenizersBackend):
    def __init__(self,        unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
        if vocab is None:
            self._vocab = {
                str(unk_token): 0,
                str(bos_token): 1,
                str(eos_token): 2,
            }

        else:
            self._vocab = vocab

        if merges is not None:
            self._merges = merges
        else:
            self._merges = generate_merges(filtered_vocab)

        self._tokenizer = Tokenizer(
            BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
        )
        self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
            replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
        )
        super().__init__(
            tokenizer_object=self._tokenizer,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
        )

And now if you call Llama5Tokenizer() you just get an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).

The above is the main motivation towards refactoring tokenization: we want people to just instantiate a tokenizer like they would a model, empty or not and with exactly what they defined.

Non-tokenizers

If you tokenizers is not common, or you just don't want to rely on sentencepiece nor tokenizers you can just import the PythonBackend (previousl PreTrainedTokenzier) which has all the API and logic for added tokens, encoding and decoding wieht them etc.

If you want to have en less features, you can use the common PreTrainedTokenizerBase mixin, which mostly defines transformers tokenizer API: encode, decode, vocab_size, get_vocab, convert_tokens_to_ids, convert_ids_to_tokens, from_pretrained, save_pretrained, etc.

Backend Architecture Changes

Moving away from "slow" vs "fast" tokenizers:

Previously, transformers maintained two parallel implementations for many tokenizers:

"Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
"Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.

In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:

TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general its performances are better, but it also offers a lot more features that are comonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelisation etc.
SentencePieceBackend: For models requiring SentencePiece
PythonBackend: Pure Python implementations
MistralCommonBackend: Relies on MistralCommon's toknenization library. (Previously MistralCommonTokenizer)

The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.

API Changes

1. Direct tokenizer initialization with vocab and merges:

In v5, you can now initialize tokenizers directly with vocabulary and merges, enabling training custom tokenizers from scratch:

# v5: Initialize a blank tokenizer for training
from transformers import LlamaTokenizer

# Create a tokenizer with custom vocabulary and merges
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]

tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)

# Or initialize a blank tokenizer to train on your own dataset
tokenizer = LlamaTokenizer()  # Creates a blank Llama-like tokenizer

But you can no longer pass a vocab file. As this accounts for from_pretrained use-case.

2. Simplified decoding API:

The batch_decode method has been unified with decode. Both single and batch decoding now use the same method:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small") 
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))

Gives:

- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']

This is mostly because people get list[list[int]] out of generate, but then they would use decode because they use encode and would get:

   ...: tokenizer.decode([[1,2], [1,4]])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 4
      2 tokenizer = AutoTokenizer.from_pretrained("t5-small") 
      3 inputs = ["hey how are you?", "fine"]
----> 4 tokenizer.decode([[1,2], [1,4]])

File /raid/arthur/transformers/src/transformers/tokenization_utils_base.py:3948, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   3945 # Convert inputs to python lists
   3946 token_ids = to_py_obj(token_ids)
-> 3948 return self._decode(
   3949     token_ids=token_ids,
   3950     skip_special_tokens=skip_special_tokens,
   3951     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   3952     **kwargs,
   3953 )

File /raid/arthur/transformers/src/transformers/tokenization_utils_fast.py:682, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    680 if isinstance(token_ids, int):
    681     token_ids = [token_ids]
--> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    684 clean_up_tokenization_spaces = (
    685     clean_up_tokenization_spaces
    686     if clean_up_tokenization_spaces is not None
    687     else self.clean_up_tokenization_spaces
    688 )
    689 if clean_up_tokenization_spaces:

TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

3. Unified encoding API:

The encode_plus is deprecated → call directly with __call__

3. apply_chat_template returns BatchEncoding:

Previously, apply_chat_template returned input_ids for backward compatibility. In v5, it now consistently returns a BatchEncoding dict like other tokenizer methods:

# v5
messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
]

# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys())  # dict_keys(['input_ids', 'attention_mask'])

Removed legacy configuration file saving:

special_tokens_map.json - special tokens are now stored in tokenizer_config.json.
added_tokens.json - added tokens are now stored in tokenizer.json.
added_tokens_decoder is only stored when there is no tokenizer.json.

When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format.

src/transformers/models/llama/tokenization_llama.py

ArthurZucker

Nice!

ArthurZucker · 2025-10-03T13:16:21Z

tests/test_tokenization_common.py

-    @require_tokenizers
-    def test_added_token_are_matched_longest_first(self):
-        if not self.test_slow_tokenizer:
-            self.skipTest(reason="This test is only for slow tokenizers")
-
-        tokenizers = self.get_tokenizers(fast=False)


should be moved to sentencepiece as well

yes it's in test_sentencepiece_backend_mixin.py

ArthurZucker · 2025-10-03T13:17:09Z

tests/test_tokenization_common.py

-                words = ["Wonderful", "no", "inspiration", "example", "with", "subtoken"]
-                text = " ".join(words)
-                batch_size = 3
-
-                encoding = tokenizer_r.encode_plus(text, add_special_tokens=False)
-
-                batch_encoding = tokenizer_r([text] * batch_size, add_special_tokens=False)
-                num_tokens = len(encoding["input_ids"])
-
-                last_word_index = len(words) - 1
-                last_token_index = num_tokens - 1
-                last_batch_index = batch_size - 1
-                last_char_index = len(text) - 1
-
-                # words, tokens
-                self.assertEqual(len(encoding.words(0)), num_tokens)
-                self.assertEqual(max(encoding.words(0)), last_word_index)
-                self.assertEqual(min(encoding.words(0)), 0)
-                self.assertEqual(len(batch_encoding.words(last_batch_index)), num_tokens)
-                self.assertEqual(max(batch_encoding.words(last_batch_index)), last_word_index)
-                self.assertEqual(min(batch_encoding.words(last_batch_index)), 0)
-                self.assertEqual(len(encoding.tokens(0)), num_tokens)
-
-                # Assert token_to_word
-                self.assertEqual(encoding.token_to_word(0), 0)
-                self.assertEqual(encoding.token_to_word(0, 0), 0)
-                self.assertEqual(encoding.token_to_word(last_token_index), last_word_index)
-                self.assertEqual(encoding.token_to_word(0, last_token_index), last_word_index)
-                self.assertEqual(batch_encoding.token_to_word(1, 0), 0)
-                self.assertEqual(batch_encoding.token_to_word(0, last_token_index), last_word_index)
-                self.assertEqual(batch_encoding.token_to_word(last_batch_index, last_token_index), last_word_index)
-
-                # Assert word_to_tokens
-                self.assertEqual(encoding.word_to_tokens(0).start, 0)
-                self.assertEqual(encoding.word_to_tokens(0, 0).start, 0)
-                self.assertEqual(encoding.word_to_tokens(last_word_index).end, last_token_index + 1)
-                self.assertEqual(encoding.word_to_tokens(0, last_word_index).end, last_token_index + 1)
-                self.assertEqual(batch_encoding.word_to_tokens(1, 0).start, 0)
-                self.assertEqual(batch_encoding.word_to_tokens(0, last_word_index).end, last_token_index + 1)
-                self.assertEqual(
-                    batch_encoding.word_to_tokens(last_batch_index, last_word_index).end, last_token_index + 1
-                )
-
-                # Assert token_to_chars
-                self.assertEqual(encoding.token_to_chars(0).start, 0)
-                self.assertEqual(encoding.token_to_chars(0, 0).start, 0)
-                self.assertEqual(encoding.token_to_chars(last_token_index).end, last_char_index + 1)
-                self.assertEqual(encoding.token_to_chars(0, last_token_index).end, last_char_index + 1)
-                self.assertEqual(batch_encoding.token_to_chars(1, 0).start, 0)
-                self.assertEqual(batch_encoding.token_to_chars(0, last_token_index).end, last_char_index + 1)
-                self.assertEqual(
-                    batch_encoding.token_to_chars(last_batch_index, last_token_index).end, last_char_index + 1
-                )


indeed rust takes care of these for himself, the other part can be tested in sentencepiece file

seems like we only tested tokenizer_r aka rust here, and spiece / slow never supported the tokens_to_chars, word_to_tokens, etc.

ArthurZucker · 2025-10-03T13:27:15Z

tests/test_tokenization_common.py

+        self.skipTest(
+            reason="This test is now in TokenizersBackendTesterMixin - it tests tokenizers-backend API, not transformers code"


indeed thats for tokenizers overlay

ArthurZucker · 2025-10-03T13:27:47Z

tests/test_tokenization_common.py

-        # Check the changes
-        for token in special_tokens_list:


and this can go to trash as well

ArthurZucker

A good start:

    def __init__(self, vocab, merges):
    self.tokenizer = Tokenizer(
            BPE(
                vocab=vocab,
                merges=merges,
                dropout=None,
                unk_token=None,
                continuing_subword_prefix="",
                end_of_word_suffix="",
                fuse_unk=False,
                byte_fallback=False,
            )
        )

        tokenizer.normalizer = normalizers.NFC()

        tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
            [
                pre_tokenizers.Split(
                    Regex(
                        r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"""
                    ),
                    behavior="isolated",
                    invert=False,
                ),
                pre_tokenizers.ByteLevel(
                    add_prefix_space=getattr(self.original_tokenizer, "add_prefix_space", False),
                    use_regex=False,
                ),
            ]
        )

        tokenizer.decoder = decoders.ByteLevel()
        tokenizer.post_processor = processors.ByteLevel(trim_offsets=False)

        return tokenizer

Ideally I think we can even just do this, without defining the functions separately.
The only upside would have been that we can use modular for less copy pasting, but its so small that I want to have this explicit, without extra abstraction!

src/transformers/__init__.py

src/transformers/create_fast_tokenizer.py

src/transformers/models/auto/tokenization_auto.py

ArthurZucker · 2025-10-14T09:28:19Z

src/transformers/models/auto/tokenization_auto.py

+                    logger.info(
+                        "Falling back to PreTrainedSentencePieceTokenizer since tokenizer.model file was found "
+                        "but no config or tokenizer class could be determined."
+                    )


IDK if we want to fallback here! I think if tokenizer.json is not found -> we convert tokenizer.model to tokenizer.json, unless user enforces sentencepiece

enforce by passing like. tokenizer_backend="sentencepiece" for ex?

ArthurZucker · 2025-10-14T09:35:23Z

src/transformers/models/gemma/tokenization_gemma_fast.py

+    def _tokenizer(self) -> Tokenizer:
+        return Tokenizer(Unigram(self._vocab_scores, unk_id=self._unk_id(), byte_fallback=True))


yep that's good, tho I think we might want to abstract

def _model(self) -> Model: return Unigram(...)

ArthurZucker · 2025-10-14T09:37:28Z

src/transformers/models/gemma/tokenization_gemma_fast.py

-        return output
+    def _decoder(self, replacement=None, add_prefix_space=None):
+        return decoders.Sequence([decoders.Replace("▁", " "), decoders.ByteFallback(), decoders.Fuse()])



and then finally a function that shows how we build the final tokenizer. I think we want __init__ to make self.tokenizer = Tokenizer(model=self._model(), decoder=self._decoder, etc)

ArthurZucker · 2025-10-14T09:38:51Z

src/transformers/models/llama/tokenization_llama.py

+        """Tokenizer configuration for this tokenizer."""
+        return Tokenizer(BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True, byte_fallback=True, dropout=None))
+
+    def _vocab(self):


should be Initial vocab or something

ArthurZucker · 2025-10-14T09:39:38Z

src/transformers/models/llama/tokenization_llama_fast.py

no legacy in general! (we want to hide non good default most probably) so the super class will support changing this, but the real llama tokenizer is not with legacy

ArthurZucker · 2025-10-14T09:40:26Z

src/transformers/models/qwen2/tokenization_qwen2_fast.py

+    def _normalizer(self):
+        """Normalizer configuration for this tokenizer."""
+        return normalizers.NFC()


ArthurZucker

Second part of the review, very nice work on ubloating already

ArthurZucker · 2025-10-14T09:43:36Z

src/transformers/tokenization_utils_sentencepiece.py

ArthurZucker · 2025-10-14T09:45:20Z

src/transformers/tokenization_utils_base.py

+        self._special_tokens_map["additional_special_tokens"] = []  # BC default to empty list
+
+        # Directly set hidden values to allow init with tokens not yet in vocab
+        for key in list(kwargs.keys()):


we can keep this as a TODO, but with the new logic that was added, we already have the self.xxx_token and self.xxx_token_id so IDK if additional_special_tokens is even useful. Let's leave it for later anyhways

ArthurZucker · 2025-10-14T09:46:25Z

src/transformers/tokenization_utils_base.py

+                if not isinstance(value, (list, tuple)) or not all(isinstance(t, (str, AddedToken)) for t in value):
+                    raise ValueError(f"Tokens {value} for key {key} should all be str or AddedToken instances")
+                new_tokens = [
+                    (AddedToken(t, rstrip=False, lstrip=False, normalized=False, special=True) if isinstance(t, str) else t)
+                    for t in value
+                    if replace_additional_special_tokens or str(t) not in self.additional_special_tokens
+                ]
+                if replace_additional_special_tokens and new_tokens:


I would kind of want to get rid of this and put it only in spm, because tokenizers just supports tokenizer.special_tokens which gives all special tokens -> duplicated info with the additional special tokens

ArthurZucker · 2025-10-14T09:47:05Z

src/transformers/tokenization_utils_base.py

-        return all_toks
+        seen = set()
+        all_toks = []
+        for value in self.special_tokens_map.values():


same here, would leave as abstract and rely on tokenizers's special_tokens attr if we can!

ArthurZucker · 2025-10-14T09:48:00Z

src/transformers/tokenization_utils_base.py

+    @classmethod
+    def convert_added_tokens(cls, obj: Union[AddedToken, Any], save=False, add_type_field=True):
+        if isinstance(obj, dict) and "__type" in obj and obj["__type"] == "AddedToken":
+            obj.pop("__type")
+            return AddedToken(**obj)
+        if isinstance(obj, AddedToken) and save:
+            obj = obj.__getstate__()
+            if add_type_field:
+                obj["__type"] = "AddedToken"
+            else:
+                # Don't save "special" for previous tokenizers
+                obj.pop("special")
+            return obj


IDRemember why we use this one? Only for SPM no?

src/transformers/tokenization_utils_tokenizers.py

ArthurZucker · 2025-10-14T09:52:51Z

src/transformers/tokenization_utils_tokenizers.py

    ) -> BatchEncoding:
+        # Input validation (from _call_one)
+        def _is_valid_text_input(t):


I think (but I might be wrong here) that tokenizers does the typechecking itself as well

src/transformers/tokenization_utils_fast.py

ArthurZucker · 2025-10-14T09:55:01Z

tests/models/gemma/test_tokenization_gemma.py

-            self.assertEqual(tokens, EXPECTED_TOKENS)
+    def test_integration_expected_token_ids(self):
+        for tok in self.tokenizers:
+            self.assertEqual(tok.encode(input_string), expected_token_ids)


this is just missing a decode test

ArthurZucker · 2025-10-14T09:55:10Z

tests/models/gemma/test_tokenization_gemma.py

overall LGTM!

ArthurZucker · 2025-10-16T07:41:48Z

src/transformers/models/xglm/tokenization_xglm.py

+                str(unk_token): 3,
+            }
+
+        self._merges = merges if merges is not None else generate_merges(self._vocab)


you actually should never generate merges out of the bos pad eos unk ! so the merge generation should happen before

is it all special tokens or just these 4? in convert_slow_tokenizer it currently indexes the vocab[3:]

src/transformers/models/gemma/tokenization_gemma.py

ArthurZucker · 2025-10-16T07:43:35Z

src/transformers/models/gemma/tokenization_gemma.py

+        self.add_tokens(list(self.all_special_tokens), special_tokens=True)
+        self.update_post_processor()


both can probably be called from the TokenizerBackend class wdyt? As in we are adding the post processor thing to all of them, and that already by default special tokens need to be added?

ArthurZucker · 2025-10-16T07:44:21Z

src/transformers/models/gemma/tokenization_gemma.py

-            sub_texts = "".join(sub_texts)
-
-        return sub_texts.replace(SPIECE_UNDERLINE, " ")
+        self._post_init()


you can also just call

self.add_tokens(list(self.all_special_tokens), special_tokens=True)

but adding token has historically been done in the super call!

…okenizer

* consolidate python and utils tokenization files, they are copies * ruff and ref * Format

…okenizer

…nto one_tokenizer

* fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]>

* remove zero_like + scatter * fix mixtral moe * fix other moe models as well * fix ci * fix modular mixtral * fix qwen2_moe + qwen3_next * fix device mismatch for qwen3_vl_moe to pass tests * fix modular mixtral * fix other models * rm slow tokenizers (#40936) * fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]> * [loading/saving] Reverse all loading operations when saving (#42396) * first shot * default to reversing * oupso * oupsi 2 * oupsi 3 * fix renamed kwargs * fix timm_wrapper * remove fix_state_dict methods * can do it all the time, with __init__ as well * doc * oupsi * fix * create helper * fix annotation annoying isue * small fix * small fixes * alright commit all that already * oupsi * the fix * update quantizers * this works * the hardcoded regex got me hard.... * style * the final one * cleanup a bit * better * style * oupsi readded it * do it inside the ops instead - no need for full names anymore * reverse quantizers and simplify signatures * small thingy * add no_grad decorator * utils to rename keys * oupssii again * add test * simplify nicely * Fix T5 tests: use generation_config for generation parameters (#42419) * pass the generation parameters to generate() * fix use_task_specific_params to separate model.config and model.generation_config params * fix style * some fixes * remove redundant check * update expectation for llama_7b_bf16 on rocm * Update tests/models/llama/test_modeling_llama.py Co-authored-by: Rémi Ouazan <[email protected]> --------- Co-authored-by: Rémi Ouazan <[email protected]> * linting * more fix to pass the CI tests * fix lfm2 moe * fix docstring * fix docstring * fix qwen like model * fix flex olmo * revert lfm2 moe config * make fixup * fix docstring * fix conversion mapping * fix inference of gpt-oss * add some fixes to gpt-oss (but still not good) * fix modular * we need errors I think * fix config issue * this was fixed --------- Co-authored-by: Ita Zaporozhets <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: Cyril Vallez <[email protected]> Co-authored-by: BADAOUI Abdennacer <[email protected]> Co-authored-by: Rémi Ouazan <[email protected]>

* fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (huggingface#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]>

* remove zero_like + scatter * fix mixtral moe * fix other moe models as well * fix ci * fix modular mixtral * fix qwen2_moe + qwen3_next * fix device mismatch for qwen3_vl_moe to pass tests * fix modular mixtral * fix other models * rm slow tokenizers (huggingface#40936) * fixes missed * gemma test fix * refactor * rm legacy from llama * added renaming * add _model * update legacy * update legacy * fix docstring * always load blank, then set _tokenizer if we have it * new toks * update all berttokenizer based models * apply feedback - delete bert duplicates * more models --> fast only * more convert_slow models * fix common test refs * updating fast only tokenizers * openai and pegasus * enable sentencepiecebackend * more models * code gen * t5 * code gen tests * speecht5 * mbart * mbart50 * more models * more models * layouglmv2 * update tests * update tests * update tests * pretrainedtokenizer * whisper * whisper * layoutxlm and storing backends * refactor sentencepiecebackend and additional_special_tokens * renaming tokenization_utils --> tokenization_python * udpate tests * bert test * blenderbot * clip * codegen * code_llama * cohere * deberata, deberat v2, funnel * gpt2 * batch update tests * pegasus qwen2 roberta * more models * layout tests * some renaming * fix references to utils_fast * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix refs * fix some tests * regression * fix refs * fix refs * missed the most crucial file in my last commit * fix refs * fix refs * fix refs * batch encode fix * fix some tests * BC for batch_decode bc too many refs * more tests * fix more tests * fix for processors * fixing more models * deleted mbart50 by accident * seamless m4t * albert fix * whisper * layout3 * attempt to fix cached tokenizers on CI * trying another fix on CI * again try to work around CI * bertweet * tapas * mbart50 * luke * mluke * markuplm * markuplm * fix some more auto tests * some random model failures * mistralcommontestser * more fixes * ref fix * siglip * marian * plbart * update utils toks * seamless m4t * roc bert * udpate byt5 test * xlm * esm * roformer * code llama * biogpt * m2m100 * dpr and flaubert * xlm and speech to text * tok backend pass object * tokenizer object pass * wav2vec2 * wav2vec2 * cpmant * update utils tokenizers * cpmant * bartpho * test apply chat template assistant mask * apply chat template video * apply chat template assistant mask * test torch * update from slow in base and fix donut processor errors * auto to point to tokenizers backend, fix kosmos2 * some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert * missed file from last commit * idefics2 * fixup * fixup * pretrained tokenizer fast test update * stash * bad merged * cherry pick more stuff that did not merge well * fix gptsw3 * nit warn for now * update error raising * just ran fixup * bring back bert legacy * fix * nit * fix 56 errors on blenderbotsmall? * 18 for blenderbotsmall * tok auto * missed clip * fix tests * something missed * token healing * tok common tests update - nonmodel * try to fix non-model test in test_tokenization_utils * fix hub tests * try to fix hub tests * custom vocab related fixed * bert jap * BERT JAP * rename bert legacy to bert legacy * Wav2vec2 * fix in tok python to update total vocab size - fixes speech t5 * blender bot small * forgot test file * test failures * marian * gpt2 tiktoken * big bird / marian * udop * forgot couple changes * test_serve fix * missing import * a couple processors fixes * style partly * fix to fetch tests ci * Revert branch back to commit f5bc69e state * revert branch to styling * update mistral after merge * fixes for non model tests * some processor test fixes * more processor test fixes * more processor fixes * hub tests * python tok utils * fix hub test * make style for now * remove problemattic fic copies * python utils/check_copies.py --fix_and_overwrite * more styling * fixup * silence docstirng * fix import? * fix imports * add the local test as well * throw spm error * llamas * fix a couple tests * broke ci * broke ci * broke ci * broke ci * add logs to debug gemma on ci * gemma and llama * gemma * revert las commit * gemma debug * gemma debug * gemma * safely import spiece backend * tok tests * check none * setup and qual * ruff * del dev files * tok auto * fill docstrings * update auto * blenderbot small nit * add migration guide * move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor` * rename MistralCommonTokenizer to MistralCommonB ackend * nit * fix failures * fixup * remoove one old test * mark the slow one as slow * very small fixes * update auto mapping for missing ones * fixup lorsd * fixup doc and stuff * should be the final fixe * processing update * update * FIX or brute AI fix the llava test * style * slow? * fix is offline mode? * fix mt5 * One tok utils (huggingface#42462) * consolidate python and utils tokenization files, they are copies * ruff and ref * Format * fix cohere * ? * up * am I dumbb? * grumble --------- Co-authored-by: Arthur <[email protected]> * [loading/saving] Reverse all loading operations when saving (huggingface#42396) * first shot * default to reversing * oupso * oupsi 2 * oupsi 3 * fix renamed kwargs * fix timm_wrapper * remove fix_state_dict methods * can do it all the time, with __init__ as well * doc * oupsi * fix * create helper * fix annotation annoying isue * small fix * small fixes * alright commit all that already * oupsi * the fix * update quantizers * this works * the hardcoded regex got me hard.... * style * the final one * cleanup a bit * better * style * oupsi readded it * do it inside the ops instead - no need for full names anymore * reverse quantizers and simplify signatures * small thingy * add no_grad decorator * utils to rename keys * oupssii again * add test * simplify nicely * Fix T5 tests: use generation_config for generation parameters (huggingface#42419) * pass the generation parameters to generate() * fix use_task_specific_params to separate model.config and model.generation_config params * fix style * some fixes * remove redundant check * update expectation for llama_7b_bf16 on rocm * Update tests/models/llama/test_modeling_llama.py Co-authored-by: Rémi Ouazan <[email protected]> --------- Co-authored-by: Rémi Ouazan <[email protected]> * linting * more fix to pass the CI tests * fix lfm2 moe * fix docstring * fix docstring * fix qwen like model * fix flex olmo * revert lfm2 moe config * make fixup * fix docstring * fix conversion mapping * fix inference of gpt-oss * add some fixes to gpt-oss (but still not good) * fix modular * we need errors I think * fix config issue * this was fixed --------- Co-authored-by: Ita Zaporozhets <[email protected]> Co-authored-by: Arthur <[email protected]> Co-authored-by: Cyril Vallez <[email protected]> Co-authored-by: BADAOUI Abdennacer <[email protected]> Co-authored-by: Rémi Ouazan <[email protected]>

itazap changed the title ~~rm slow tokenizer llama~~ rm slow tokenizers Sep 19, 2025

ArthurZucker reviewed Sep 22, 2025

View reviewed changes

src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved

itazap force-pushed the one_tokenizer branch 4 times, most recently from af77c18 to dc0611f Compare September 25, 2025 11:31

itazap commented Sep 25, 2025

View reviewed changes

src/transformers/models/llama/tokenization_llama.py Outdated Show resolved Hide resolved

itazap marked this pull request as draft September 30, 2025 09:03

ArthurZucker reviewed Oct 3, 2025

View reviewed changes

itazap force-pushed the one_tokenizer branch from 0e0a75f to 6c25f26 Compare October 7, 2025 10:17

itazap mentioned this pull request Oct 10, 2025

Welcome v5 #40822

Open

itazap added 2 commits October 10, 2025 18:31

fixes missed

5fe5666

gemma test fix

51e62e1

itazap requested a review from ArthurZucker October 14, 2025 08:16

ArthurZucker reviewed Oct 14, 2025

View reviewed changes

itazap added 7 commits October 14, 2025 14:34

refactor

0e5dbdf

rm legacy from llama

9136d3c

added renaming

ab77f57

add _model

36bc3ef

update legacy

c4f045c

update legacy

c80dd1d

fix docstring

790c092

itazap requested a review from ArthurZucker October 14, 2025 13:56

itazap added 3 commits October 14, 2025 18:10

always load blank, then set _tokenizer if we have it

f4d956a

new toks

b2c320c

update all berttokenizer based models

0c3caff

ArthurZucker reviewed Oct 16, 2025

View reviewed changes

itazap added 2 commits October 16, 2025 13:48

apply feedback - delete bert duplicates

d43412a

more models --> fast only

48eeb50

ArthurZucker and others added 11 commits November 27, 2025 15:41

update

9a5638d

FIX or brute AI fix the llava test

7c32dfb

style

c520a66

slow?

718b2f0

Merge branch 'main' of github.com:huggingface/transformers into one_t…

20d9036

…okenizer

fix is offline mode?

8f536c2

fix mt5

e96c18b

One tok utils (#42462)

5ce65b8

* consolidate python and utils tokenization files, they are copies * ruff and ref * Format

Merge branch 'main' of github.com:huggingface/transformers into one_t…

4418e8a

…okenizer

fix cohere

7f9954a

Merge branch 'one_tokenizer' of github.com:huggingface/transformers i…

bfa5fd0

…nto one_tokenizer

ArthurZucker added for_v5? Core: Tokenization Internals of the library; Tokenization. labels Nov 27, 2025

ArthurZucker added 4 commits November 27, 2025 18:47

?

4dce834

up

fcdc9bb

am I dumbb?

a5a3a7c

grumble

0244be9

ArthurZucker merged commit 05c0e1d into main Nov 27, 2025
18 of 24 checks passed

ArthurZucker deleted the one_tokenizer branch November 27, 2025 18:24

Rocketknight1 mentioned this pull request Nov 28, 2025

Fix parse_response after tokenizer refactor #42300

Merged

albertvillanova mentioned this pull request Nov 28, 2025

CI fails with dev dependencies: ValueError: Could not load tokenizer huggingface/trl#4599

Open

hmellor mentioned this pull request Nov 28, 2025

Remove all_special_tokens_extended from tokenizer code vllm-project/vllm#29686

Merged

jiqing-feng mentioned this pull request Dec 2, 2025

Remove slow tokenizer break pipeline loading. #42540

Closed

4 tasks

This was referenced Dec 2, 2025

AttributeError: Qwen2Tokenizer has no attribute response_schema huggingface/trl#4609

Open

🕵️‍♂️ GRPO: Agent training huggingface/trl#4300

Merged

ydshieh mentioned this pull request Dec 2, 2025

SeamlessM4TProcessorTest::test_save_load_pretrained_addition #42562

Merged

		self.skipTest(
		reason="This test is now in TokenizersBackendTesterMixin - it tests tokenizers-backend API, not transformers code"

		def _tokenizer(self) -> Tokenizer:
		return Tokenizer(Unigram(self._vocab_scores, unk_id=self._unk_id(), byte_fallback=True))

		self.add_tokens(list(self.all_special_tokens), special_tokens=True)
		self.update_post_processor()

rm slow tokenizers #40936

rm slow tokenizers #40936

Conversation

itazap commented Sep 17, 2025 • edited by ArthurZucker Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tokenization

Non-tokenizers

Backend Architecture Changes

API Changes

Removed legacy configuration file saving:

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itazap commented Sep 17, 2025 •

edited by ArthurZucker

Loading