-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Closed
Description
Environment info
transformersversion: current master- Platform: MacOS
- Python version: 3.7
Who can help
Information
It seems that passing pretokenized input to the Tokenizer and setting is_pretokenized=True doesn't prevent the Tokenizer from further tokenizing the input. This issue already came up in #6046 and the reason for this seems to be #6573 . A workaround is to set is_pretokenized=False.
What hasn't been reported yet is that this issue also arises with FastTokenizers where we see the same behavior. However, there is no workaround for FastTokenizers (or at least I haven't found one...). Setting is_pretokenized=False will raise a ValueError.
To reproduce
from transformers.tokenization_auto import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
fast_tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased", use_fast=True)
text = "Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist"
pretokenized_text = ['Schar', '##tau', 'sagte', 'dem', 'Tages', '##spiegel', ',', 'dass', 'Fischer', 'ein', 'Id', '##iot', 'ist']
tokenized = tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
pretokenized_tok = tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
pretokenized_tok_2 = tokenizer.encode(pretokenized_text, is_pretokenized=False)
# returns list of len 15 -> 13 tokens + 2 special tokens
fast_tokenized = fast_tokenizer.encode(text)
# returns list of len 15 -> 13 tokens + 2 special tokens
fast_pretokenized_tok = fast_tokenizer.encode(pretokenized_text, is_pretokenized=True)
# returns list of len 23 -> too large
# fast_pretokenizer_tok2 = fast_tokenizer.encode(pretokenized_text, is_pretokenized=False)
# would raise: 'ValueError: TextInputSequence must be str'
tokenized_decoded = tokenizer.decode(tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
pretokenized_tok_decoded = tokenizer.decode(pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'
pretokenized_tok_2_decoded = tokenizer.decode(pretokenized_tok_2)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
fast_tokenized_decoded = fast_tokenizer.decode(fast_tokenized)
# returns '[CLS] Schartau sagte dem Tagesspiegel, dass Fischer ein Idiot ist [SEP]'
fast_pretokenized_tok_decoded = fast_tokenizer.decode(fast_pretokenized_tok)
# returns '[CLS] Schar # # tau sagte dem Tages # # spiegel, dass Fischer ein Id # # iot ist [SEP]'Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels