Inconsistent handling of empty string in tokenizers

## Environment info
     
- `transformers` version: 3.0.2
- Platform: Linux-5.4.0-42-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.7
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: no (issue with tokenizer)
- Using distributed or parallel set-up in script?: no (issue with tokenizer)

### Who can help
@mfuntowicz

## Information
I'm encountering inconsistent handling of empty string with `BertTokenizerFast` when tokenizing pairs. In particular, I'm observing an error when one string in a text pair is empty AND truncation is performed using the `longest_first` strategy. This issue only manifests when truncation actually occurs. If one of the strings are empty, and the other is short enough that truncation does not occur (or both strings are empty), then no error occurs (see example below). I haven't checked other tokenizers to see if they exhibit similar behavior.

## Example
```python
from transformers import BertTokenizerFast

tokz = BertTokenizerFast.from_pretrained('bert-base-uncased')

empty = ''
short = 'the ' * 509
long = 'the ' * 510

# Case 1: no truncation, no error
tokz(empty, empty, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

# Case 2: no truncation, no error
tokz(empty, short, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

# Case 3: truncation, no error
tokz(long, long, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

# Case 4: truncation, Truncation error
tokz(empty, long, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)
```

## Possible Cause
This appears to be due to logic in the tokenizers package that throws an error if any of the strings has length 0 after truncation. 

https://github.com/huggingface/tokenizers/blob/331e3ffc257ec2792ad88f6ff820d335859ed775/tokenizers/src/utils/truncation.rs#L100

I assume there are some checks occurring that prevent this code path from being hit in the other cases above, but I wasn't able to identify where.
.

## Stacktrace
```
Exception                                 Traceback (most recent call last)
<ipython-input-22-dda0aff18100> in <module>
----> 1 tokz('', 'word ' * 510, padding=True, truncation='longest_first', return_tensors='pt', max_length=512)

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in __call__(self, text, text_pair, add_special_tokens, padding, truncation, max_length, s$ride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_ma$ping, return_length, verbose, **kwargs)
   1667                 return_length=return_length,
   1668                 verbose=verbose,
-> 1669                 **kwargs,
   1670             )
   1671

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length$ stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets$mapping, return_length, verbose, **kwargs)
   1735             return_length=return_length,
   1736             verbose=verbose,
-> 1737             **kwargs,
   1738         )
   1739

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_s$rategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_m$sk, return_offsets_mapping, return_length, verbose, **kwargs)
    418             return_length=return_length,
    419             verbose=verbose,
--> 420             **kwargs,
    421         )
    422

~/anaconda3/envs/aq/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strateg$, truncation_strategy, max_length, stride, is_pretokenized, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_s$ecial_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    329                     *batch_text_or_text_pairs[0],
    330                     add_special_tokens=add_special_tokens,
--> 331                     is_pretokenized=is_pretokenized,
    332                 )
    333             else:

~/anaconda3/envs/aq/lib/python3.7/site-packages/tokenizers/implementations/base_tokenizer.py in encode(self, sequence, pair, is_pretokenized, add_special_tokens)
    210             raise ValueError("encode: `sequence` can't be `None`")
    211
--> 212         return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
    213
    214     def encode_batch(

Exception: Truncation error: Specified max length is too low to respect the various constraints
```

## To reproduce

See example above

## Expected behavior

The handling of empty strings (cases 1, 2, and 4) should be consistent (either empty string are ok, or they result in an error).

edit: grammar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of empty string in tokenizers #6669

Environment info

Who can help

Information

Example

Possible Cause

Stacktrace

To reproduce

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent handling of empty string in tokenizers #6669

Description

Environment info

Who can help

Information

Example

Possible Cause

Stacktrace

To reproduce

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions