Skip to content

Inconsistent handling of empty string in tokenizers #6669

@thomlake

Description

@thomlake

Environment info

  • transformers version: 3.0.2
  • Platform: Linux-5.4.0-42-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: no (issue with tokenizer)
  • Using distributed or parallel set-up in script?: no (issue with tokenizer)

Who can help

@mfuntowicz

Information

I'm encountering inconsistent handling of empty string with BertTokenizerFast when tokenizing pairs. In particular, I'm observing an error when one string in a text pair is empty AND truncation is performed using the longest_first strategy. This issue only manifests when truncation actually occurs. If one of the strings are empty, and the other is short enough that truncation does not occur (or both strings are empty), then no error occurs (see example below). I haven't checked other tokenizers to see if they exhibit similar behavior.

Example

from transformers import BertTokenizerFast

tokz = BertTokenizerFast.from_pretrained('bert-base-uncased')

empty = ''
short = 'the ' * 509
long = 'the ' * 510

# Case 1: no truncation, no error
tokz(empty, empty, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

# Case 2: no truncation, no error
tokz(empty, short, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

# Case 3: truncation, no error
tokz(long, long, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

# Case 4: truncation, Truncation error
tokz(empty, long, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

Possible Cause

This appears to be due to logic in the tokenizers package that throws an error if any of the strings has length 0 after truncation.

https://github.com/huggingface/tokenizers/blob/331e3ffc257ec2792ad88f6ff820d335859ed775/tokenizers/src/utils/truncation.rs#L100

I assume there are some checks occurring that prevent this code path from being hit in the other cases above, but I wasn't able to identify where.
.

Stacktrace

Exception                                 Traceback (most recent call last)
<ipython-input-22-dda0aff18100> in <module>
----> 1 tokz('', 'word ' * 510, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, s$ride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_ma$ping, return_length, verbose, **kwargs)
   1667                 return_length=return_length,
   1668                 verbose=verbose,
-> 1669                 **kwargs,
   1670             )
   1671

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length$ stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets$mapping, return_length, verbose, **kwargs)
   1735             return_length=return_length,
   1736             verbose=verbose,
-> 1737             **kwargs,
   1738         )
   1739

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_s$rategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_m$sk, return_offsets_mapping, return_length, verbose, **kwargs)
    418             return_length=return_length,
    419             verbose=verbose,
--> 420             **kwargs,
    421         )
    422

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strateg$, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_s$ecial_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    329                     *batch_text_or_text_pairs[0],
    330                     add_special_tokens=add_special_tokens,
--> 331                     is_pretokenized=is_pretokenized,
    332                 )
    333             else:

~/anaconda3/envs/aq/lib/python3.7/site-packages/tokenizers/implementations/base_tokenizer.py in encode(self, sequence, pair, is_pretokenized, add_special_tokens)
    210             raise ValueError("encode: `sequence` can't be `None`")
    211
--> 212         return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
    213
    214     def encode_batch(

Exception: Truncation error: Specified max length is too low to respect the various constraints

To reproduce

See example above

Expected behavior

The handling of empty strings (cases 1, 2, and 4) should be consistent (either empty string are ok, or they result in an error).

edit: grammar

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions