-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Inconsistent handling of empty string in tokenizers #6669
Description
Environment info
transformersversion: 3.0.2- Platform: Linux-5.4.0-42-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.7
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: no (issue with tokenizer)
- Using distributed or parallel set-up in script?: no (issue with tokenizer)
Who can help
Information
I'm encountering inconsistent handling of empty string with BertTokenizerFast when tokenizing pairs. In particular, I'm observing an error when one string in a text pair is empty AND truncation is performed using the longest_first strategy. This issue only manifests when truncation actually occurs. If one of the strings are empty, and the other is short enough that truncation does not occur (or both strings are empty), then no error occurs (see example below). I haven't checked other tokenizers to see if they exhibit similar behavior.
Example
from transformers import BertTokenizerFast
tokz = BertTokenizerFast.from_pretrained('bert-base-uncased')
empty = ''
short = 'the ' * 509
long = 'the ' * 510
# Case 1: no truncation, no error
tokz(empty, empty, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)
# Case 2: no truncation, no error
tokz(empty, short, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)
# Case 3: truncation, no error
tokz(long, long, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)
# Case 4: truncation, Truncation error
tokz(empty, long, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)Possible Cause
This appears to be due to logic in the tokenizers package that throws an error if any of the strings has length 0 after truncation.
I assume there are some checks occurring that prevent this code path from being hit in the other cases above, but I wasn't able to identify where.
.
Stacktrace
Exception Traceback (most recent call last)
<ipython-input-22-dda0aff18100> in <module>
----> 1 tokz('', 'word ' * 510, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)
~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, s$ride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_ma$ping, return_length, verbose, **kwargs)
1667 return_length=return_length,
1668 verbose=verbose,
-> 1669 **kwargs,
1670 )
1671
~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length$ stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets$mapping, return_length, verbose, **kwargs)
1735 return_length=return_length,
1736 verbose=verbose,
-> 1737 **kwargs,
1738 )
1739
~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_s$rategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_m$sk, return_offsets_mapping, return_length, verbose, **kwargs)
418 return_length=return_length,
419 verbose=verbose,
--> 420 **kwargs,
421 )
422
~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strateg$, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_s$ecial_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
329 *batch_text_or_text_pairs[0],
330 add_special_tokens=add_special_tokens,
--> 331 is_pretokenized=is_pretokenized,
332 )
333 else:
~/anaconda3/envs/aq/lib/python3.7/site-packages/tokenizers/implementations/base_tokenizer.py in encode(self, sequence, pair, is_pretokenized, add_special_tokens)
210 raise ValueError("encode: `sequence` can't be `None`")
211
--> 212 return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
213
214 def encode_batch(
Exception: Truncation error: Specified max length is too low to respect the various constraints
To reproduce
See example above
Expected behavior
The handling of empty strings (cases 1, 2, and 4) should be consistent (either empty string are ok, or they result in an error).
edit: grammar