Skip to content

Make LongestFirst truncation constant time and consistent#389

Merged
n1t0 merged 3 commits intohuggingface:masterfrom
thomlake:fix-longest-first
Sep 18, 2020
Merged

Make LongestFirst truncation constant time and consistent#389
n1t0 merged 3 commits intohuggingface:masterfrom
thomlake:fix-longest-first

Conversation

@thomlake
Copy link
Copy Markdown
Contributor

@thomlake thomlake commented Sep 2, 2020

This PR addresses a consistency issue with the handling of empty input when truncating pairs with the TruncationStrategy::LongestFirst, #381 and huggingface/transformers#6669. It also makes the algorithm constant time and adds tests (previously there were none).

@thomlake
Copy link
Copy Markdown
Contributor Author

thomlake commented Sep 3, 2020

@n1t0, I notice you've merged several recent PRs, and was hoping you might be able to review or comment on this one ? I've ran into this issue a few times, it would be nice to have resolved. Thanks.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Sep 3, 2020

Hi @thomlake and thank you for taking care of this!

These modifications seem right to me, but I wanted to check if the case that led to add the TruncationError::MaxLengthTooLow is still covered., and it appears it isn't in the current state.

You can check using this example using transformers:

>>> from transformers import BertTokenizerFast
>>> tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
>>> tokenizer.batch_encode_plus([("ozeirgrezjgorjoij", "reoijgoirjg")], max_length=4, truncation="longest_first")

thread '<unnamed>' panicked at 'assertion failed: max_len > 0', /home/moian/devel/HuggingFace/tokenizers/tokenizers/src/tokenizer/encoding.rs:217:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/moian/devel/HuggingFace/transformers/src/transformers/tokenization_utils_base.py", line 2137, in batch_encode_plus
    **kwargs,
  File "/home/moian/devel/HuggingFace/transformers/src/transformers/tokenization_utils_fast.py", line 368, in _batch_encode_plus
    is_pretokenized=is_pretokenized,
  File "/home/moian/devel/HuggingFace/tokenizers/bindings/python/.env/lib/python3.7/site-packages/tokenizers-0.9.0.dev0-py3.7-linux-x86_64.egg/tokenizers/implementations/base_tokenizer.py", line 211, in encode
    return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
pyo3_runtime.PanicException: assertion failed: max_len > 0

This is happening because we need one of the sequences to be totally truncated (since there are special tokens), and at the moment we don't support truncating to 0. What do you think about this?

@thomlake
Copy link
Copy Markdown
Contributor Author

thomlake commented Sep 4, 2020

@n1t0 Thanks for taking the time to look at this. I can certainly add the assertion back in. However, before doing so I'd like to make the case that the proposed behavior is inconsistent.

  1. The behavior does not appear to be consistent with BertTokenizer. I just cloned transformers master and can recreate the error you posted using BertTokenizerFast. However, if I use BertTokenizer I get different results.
>>> from transformers import BertTokenizer
>>> tokz = BertTokenizer.from_pretrained("bert-base-cased")
>>> tok_slow.batch_encode_plus([("ozeirgrezjgorjoij", "reoijgoirjg")], max_length=4, truncation="longest_first")
# {'input_ids': [[101, 24919, 102, 102]], 'token_type_ids': [[0, 0, 0, 1]], 'attention_mask': [[1, 1, 1, 1]]}
  1. Previously, the function was not self consistent with respect to empty input. In particular, empty input was fine in all cases, unless one input was empty and the other had length > max_length. This is due to the check on line 84, total_length > params.max_length which would cause the match, and therefore the subsequent assertions, to be skipped entirely. I mentioned this in Inconsistent handling of empty inputs when truncating using LongestFirst #381 and have copied an example here for visibility.
from transformers import BertTokenizer, BertTokenizerFast

tokz = BertTokenizer.from_pretrained('bert-base-uncased')
tokz_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')

empty = ''
short = 'the ' * 509
long = 'the ' * 510

# Case 1: no truncation, no error
tokz(empty, empty, truncation='longest_first', max_length=512)
tokz_fast(empty, empty, truncation='longest_first', max_length=512)

# Case 2: no truncation, no error
tokz(empty, short, truncation='longest_first', max_length=512)
tokz_fast(empty, short, truncation='longest_first', max_length=512)

# Case 3: truncation, no error
tokz(long, long, truncation='longest_first', max_length=512)
tokz_fast(long, long, truncation='longest_first', max_length=512)

# Case 4: truncation, Truncation error from BertTokenizerFast only
tokz(empty, long, truncation='longest_first', max_length=512)
tokz_fast(empty, long, truncation='longest_first', max_length=512)

From this perspective I'd propose allowing empty input in all cases, as well as truncation to length zero. Let me know your thoughts, and if you still think the assertion should be added back in.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Sep 4, 2020

Yes, I totally agree with you, and the changes you are proposing.

When first writing Encoding::truncate I thought it didn't make sense to allow for zero-length truncation, but it should be left to the caller to decide what makes sense. The part of the algorithm that handles the overflowing tokens will need to be updated too; it can probably be the whole initial Encoding in this case.

@thomlake
Copy link
Copy Markdown
Contributor Author

thomlake commented Sep 4, 2020

Yes, I totally agree with you, and the changes you are proposing.

Awesome! Glad we could come to a consensus.

The part of the algorithm that handles the overflowing tokens will need to be updated too; it can probably be the whole initial Encoding in this case.

If you'd like those changes to happen before merging this, I'm happy to look into it, and roll them into this PR. For clarification, do you basically mean changing Encoding.truncate to eliminate the assert!(max_len > 0);, and adjust the code that follows to accommodate?

Also, wanted to take a moment to thank you @n1t0 and the rest of the huggingface team for building awesome stuff.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Sep 8, 2020

For clarification, do you basically mean changing Encoding.truncate to eliminate the assert!(max_len > 0);, and adjust the code that follows to accommodate?

Yes, exactly! All the code that follows the second assert is about cutting everything in chunks that all respect the provided max_len. When max_len == 0 it doesn't make sense though. In this case, we can probably keep everything in one single piece.

If you'd like those changes to happen before merging this, I'm happy to look into it, and roll them into this PR.

Awesome! Feel free to include everything in this PR.

Also, wanted to take a moment to thank you @n1t0 and the rest of the huggingface team for building awesome stuff.

Thank you very much, we appreciate the kind words!

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Sep 16, 2020

Hi @thomlake! We will make a new release soon, probably next week, and I think it would be awesome to have this included.
If you are still down to make the last changes we talked about, that would be great, otherwise I'll try to do it before the release.

@thomlake
Copy link
Copy Markdown
Contributor Author

Hey @n1t0, sorry for the delay. I'll push the requested changes today or tomorrow.

assert!(max_len > 0);

if max_len == 0 {
let o = Encoding {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is the correct/best way to do this. I'm not very familiar with Rust, so was mostly trying things and following the guidance from the compiler.

@thomlake
Copy link
Copy Markdown
Contributor Author

thomlake commented Sep 17, 2020

@n1t0 I've pushed a commit to support zero length truncation. I left a comment on a piece of code I'm not sure of (this is my first time working with Rust).

I have to admit, after implementing the changes I'm a bit unsure of how sensible the latest commit is. In particular, I think I was conflating supporting zero length input, and truncating to length zero. I can imagine cases where the former makes sense, and would be easy for a user to catch and avoid if unintended by checking len(input_string) before tokenization. However, I can't really come up with a good use case for truncating to zero length when the input wasn't empty initially. It is also more difficult for a user to know if this will happen a priori due to special tokens. That being said, I believe this should help make these tokenizers more consistent with the transformers implementations.

Let me know what you think. I'm happy to make any changes if needed, or to simply have this merged in.

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Sep 18, 2020

Thank you very much for taking care of this @thomlake!

I just pushed a quick fix for the part where you weren't sure: we can indeed avoid cloning the data in a new Encoding before pushing it in the overflowing, by using std::mem::replace.

I have to admit, after implementing the changes I'm a bit unsure of how sensible the latest commit is. In particular, I think I was conflating supporting zero length input, and truncating to length zero. I can imagine cases where the former makes sense, and would be easy for a user to catch and avoid if unintended by checking len(input_string) before tokenization. However, I can't really come up with a good use case for truncating to zero length when the input wasn't empty initially. It is also more difficult for a user to know if this will happen a priori due to special tokens. That being said, I believe this should help make these tokenizers more consistent with the transformers implementations.

I agree, and at first was also thinking there was no good use case for truncating to zero length. Now, it shouldn't happen often though, and being consistent with transformers is important, so I'm happy with this.

@thomlake
Copy link
Copy Markdown
Contributor Author

Awesome, thanks @n1t0.

using std::mem::replace

TIL!

being consistent with transformers is important

Yeah, I imagined consistency was most important.

Let me know if there is anything else you need from me before merging. I appreciate the efforts and time on this!

)
}

fn truncatate_and_assert(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

@n1t0
Copy link
Copy Markdown
Contributor

n1t0 commented Sep 18, 2020

Was just waiting to make sure the checks were still good. Merging now. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants