Make LongestFirst truncation constant time and consistent by thomlake · Pull Request #389 · huggingface/tokenizers

thomlake · 2020-09-02T17:06:37Z

This PR addresses a consistency issue with the handling of empty input when truncating pairs with the TruncationStrategy::LongestFirst, #381 and huggingface/transformers#6669. It also makes the algorithm constant time and adds tests (previously there were none).

thomlake · 2020-09-03T20:42:26Z

@n1t0, I notice you've merged several recent PRs, and was hoping you might be able to review or comment on this one ? I've ran into this issue a few times, it would be nice to have resolved. Thanks.

n1t0 · 2020-09-03T21:50:21Z

Hi @thomlake and thank you for taking care of this!

These modifications seem right to me, but I wanted to check if the case that led to add the TruncationError::MaxLengthTooLow is still covered., and it appears it isn't in the current state.

You can check using this example using transformers:

>>> from transformers import BertTokenizerFast
>>> tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
>>> tokenizer.batch_encode_plus([("ozeirgrezjgorjoij", "reoijgoirjg")], max_length=4, truncation="longest_first")

thread '<unnamed>' panicked at 'assertion failed: max_len > 0', /home/moian/devel/HuggingFace/tokenizers/tokenizers/src/tokenizer/encoding.rs:217:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moian/devel/HuggingFace/transformers/src/transformers/tokenization_utils_base.py", line 2137, in batch_encode_plus
    **kwargs,
  File "/home/moian/devel/HuggingFace/transformers/src/transformers/tokenization_utils_fast.py", line 368, in _batch_encode_plus
    is_pretokenized=is_pretokenized,
  File "/home/moian/devel/HuggingFace/tokenizers/bindings/python/.env/lib/python3.7/site-packages/tokenizers-0.9.0.dev0-py3.7-linux-x86_64.egg/tokenizers/implementations/base_tokenizer.py", line 211, in encode
    return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
pyo3_runtime.PanicException: assertion failed: max_len > 0

This is happening because we need one of the sequences to be totally truncated (since there are special tokens), and at the moment we don't support truncating to 0. What do you think about this?

thomlake · 2020-09-04T16:11:04Z

@n1t0 Thanks for taking the time to look at this. I can certainly add the assertion back in. However, before doing so I'd like to make the case that the proposed behavior is inconsistent.

The behavior does not appear to be consistent with BertTokenizer. I just cloned transformers master and can recreate the error you posted using BertTokenizerFast. However, if I use BertTokenizer I get different results.

>>> from transformers import BertTokenizer
>>> tokz = BertTokenizer.from_pretrained("bert-base-cased")
>>> tok_slow.batch_encode_plus([("ozeirgrezjgorjoij", "reoijgoirjg")], max_length=4, truncation="longest_first")
# {'input_ids': [[101, 24919, 102, 102]], 'token_type_ids': [[0, 0, 0, 1]], 'attention_mask': [[1, 1, 1, 1]]}

Previously, the function was not self consistent with respect to empty input. In particular, empty input was fine in all cases, unless one input was empty and the other had length > max_length. This is due to the check on line 84, total_length > params.max_length which would cause the match, and therefore the subsequent assertions, to be skipped entirely. I mentioned this in Inconsistent handling of empty inputs when truncating using LongestFirst #381 and have copied an example here for visibility.

from transformers import BertTokenizer, BertTokenizerFast

tokz = BertTokenizer.from_pretrained('bert-base-uncased')
tokz_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')

empty = ''
short = 'the ' * 509
long = 'the ' * 510

# Case 1: no truncation, no error
tokz(empty, empty, truncation='longest_first', max_length=512)
tokz_fast(empty, empty, truncation='longest_first', max_length=512)

# Case 2: no truncation, no error
tokz(empty, short, truncation='longest_first', max_length=512)
tokz_fast(empty, short, truncation='longest_first', max_length=512)

# Case 3: truncation, no error
tokz(long, long, truncation='longest_first', max_length=512)
tokz_fast(long, long, truncation='longest_first', max_length=512)

# Case 4: truncation, Truncation error from BertTokenizerFast only
tokz(empty, long, truncation='longest_first', max_length=512)
tokz_fast(empty, long, truncation='longest_first', max_length=512)

From this perspective I'd propose allowing empty input in all cases, as well as truncation to length zero. Let me know your thoughts, and if you still think the assertion should be added back in.

n1t0 · 2020-09-04T18:24:59Z

Yes, I totally agree with you, and the changes you are proposing.

When first writing Encoding::truncate I thought it didn't make sense to allow for zero-length truncation, but it should be left to the caller to decide what makes sense. The part of the algorithm that handles the overflowing tokens will need to be updated too; it can probably be the whole initial Encoding in this case.

thomlake · 2020-09-04T21:19:42Z

Yes, I totally agree with you, and the changes you are proposing.

Awesome! Glad we could come to a consensus.

The part of the algorithm that handles the overflowing tokens will need to be updated too; it can probably be the whole initial Encoding in this case.

If you'd like those changes to happen before merging this, I'm happy to look into it, and roll them into this PR. For clarification, do you basically mean changing Encoding.truncate to eliminate the assert!(max_len > 0);, and adjust the code that follows to accommodate?

Also, wanted to take a moment to thank you @n1t0 and the rest of the huggingface team for building awesome stuff.

n1t0 · 2020-09-08T18:16:34Z

For clarification, do you basically mean changing Encoding.truncate to eliminate the assert!(max_len > 0);, and adjust the code that follows to accommodate?

Yes, exactly! All the code that follows the second assert is about cutting everything in chunks that all respect the provided max_len. When max_len == 0 it doesn't make sense though. In this case, we can probably keep everything in one single piece.

If you'd like those changes to happen before merging this, I'm happy to look into it, and roll them into this PR.

Awesome! Feel free to include everything in this PR.

Also, wanted to take a moment to thank you @n1t0 and the rest of the huggingface team for building awesome stuff.

Thank you very much, we appreciate the kind words!

n1t0 · 2020-09-16T12:58:27Z

Hi @thomlake! We will make a new release soon, probably next week, and I think it would be awesome to have this included.
If you are still down to make the last changes we talked about, that would be great, otherwise I'll try to do it before the release.

thomlake · 2020-09-16T14:52:34Z

Hey @n1t0, sorry for the delay. I'll push the requested changes today or tomorrow.

thomlake · 2020-09-17T15:35:19Z

tokenizers/src/tokenizer/encoding.rs

-        assert!(max_len > 0);
+
+        if max_len == 0 {
+            let o = Encoding {


I'm not sure if this is the correct/best way to do this. I'm not very familiar with Rust, so was mostly trying things and following the guidance from the compiler.

thomlake · 2020-09-17T15:56:50Z

@n1t0 I've pushed a commit to support zero length truncation. I left a comment on a piece of code I'm not sure of (this is my first time working with Rust).

I have to admit, after implementing the changes I'm a bit unsure of how sensible the latest commit is. In particular, I think I was conflating supporting zero length input, and truncating to length zero. I can imagine cases where the former makes sense, and would be easy for a user to catch and avoid if unintended by checking len(input_string) before tokenization. However, I can't really come up with a good use case for truncating to zero length when the input wasn't empty initially. It is also more difficult for a user to know if this will happen a priori due to special tokens. That being said, I believe this should help make these tokenizers more consistent with the transformers implementations.

Let me know what you think. I'm happy to make any changes if needed, or to simply have this merged in.

n1t0 · 2020-09-18T17:19:07Z

Thank you very much for taking care of this @thomlake!

I just pushed a quick fix for the part where you weren't sure: we can indeed avoid cloning the data in a new Encoding before pushing it in the overflowing, by using std::mem::replace.

I have to admit, after implementing the changes I'm a bit unsure of how sensible the latest commit is. In particular, I think I was conflating supporting zero length input, and truncating to length zero. I can imagine cases where the former makes sense, and would be easy for a user to catch and avoid if unintended by checking len(input_string) before tokenization. However, I can't really come up with a good use case for truncating to zero length when the input wasn't empty initially. It is also more difficult for a user to know if this will happen a priori due to special tokens. That being said, I believe this should help make these tokenizers more consistent with the transformers implementations.

I agree, and at first was also thinking there was no good use case for truncating to zero length. Now, it shouldn't happen often though, and being consistent with transformers is important, so I'm happy with this.

thomlake · 2020-09-18T17:49:40Z

Awesome, thanks @n1t0.

using std::mem::replace

TIL!

being consistent with transformers is important

Yeah, I imagined consistency was most important.

Let me know if there is anything else you need from me before merging. I appreciate the efforts and time on this!

thomlake · 2020-09-18T17:51:33Z

tokenizers/src/utils/truncation.rs

        )
    }

-    fn truncatate_and_assert(


n1t0 · 2020-09-18T18:11:11Z

Was just waiting to make sure the checks were still good. Merging now. Thank you!

Make LongestFirst truncation constant time and consistent

120523d

thomlake force-pushed the fix-longest-first branch from 9695df2 to 120523d Compare September 2, 2020 17:27

Support 0 length truncation

22a63de

thomlake force-pushed the fix-longest-first branch from 6f644cd to 22a63de Compare September 17, 2020 15:20

thomlake commented Sep 17, 2020

View reviewed changes

Quick fix

2a10d81

thomlake commented Sep 18, 2020

View reviewed changes

tokenizers/src/utils/truncation.rs Outdated

)

}

fn truncatate_and_assert(

Copy link
Copy Markdown

Contributor Author

thomlake Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

n1t0 merged commit c84f1d0 into huggingface:master Sep 18, 2020

thomlake mentioned this pull request Sep 18, 2020

Inconsistent handling of empty inputs when truncating using LongestFirst #381

Closed

tholor mentioned this pull request Nov 30, 2020

Add test for empty titles in DPR processor deepset-ai/FARM#643

Merged

Conversation

thomlake commented Sep 2, 2020

Uh oh!

thomlake commented Sep 3, 2020

Uh oh!

n1t0 commented Sep 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomlake commented Sep 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

n1t0 commented Sep 4, 2020

Uh oh!

thomlake commented Sep 4, 2020

Uh oh!

n1t0 commented Sep 8, 2020

Uh oh!

n1t0 commented Sep 16, 2020

Uh oh!

thomlake commented Sep 16, 2020

Uh oh!

thomlake Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

thomlake commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

n1t0 commented Sep 18, 2020

Uh oh!

thomlake commented Sep 18, 2020

Uh oh!

thomlake Sep 18, 2020

Choose a reason for hiding this comment

Uh oh!

n1t0 commented Sep 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

n1t0 commented Sep 3, 2020 •

edited

Loading

thomlake commented Sep 4, 2020 •

edited

Loading

thomlake commented Sep 17, 2020 •

edited

Loading