Make LongestFirst truncation constant time and consistent#389
Make LongestFirst truncation constant time and consistent#389n1t0 merged 3 commits intohuggingface:masterfrom
Conversation
9695df2 to
120523d
Compare
|
@n1t0, I notice you've merged several recent PRs, and was hoping you might be able to review or comment on this one ? I've ran into this issue a few times, it would be nice to have resolved. Thanks. |
|
Hi @thomlake and thank you for taking care of this! These modifications seem right to me, but I wanted to check if the case that led to add the You can check using this example using >>> from transformers import BertTokenizerFast
>>> tokenizer = BertTokenizerFast.from_pretrained("bert-base-cased")
>>> tokenizer.batch_encode_plus([("ozeirgrezjgorjoij", "reoijgoirjg")], max_length=4, truncation="longest_first")
thread '<unnamed>' panicked at 'assertion failed: max_len > 0', /home/moian/devel/HuggingFace/tokenizers/tokenizers/src/tokenizer/encoding.rs:217:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/moian/devel/HuggingFace/transformers/src/transformers/tokenization_utils_base.py", line 2137, in batch_encode_plus
**kwargs,
File "/home/moian/devel/HuggingFace/transformers/src/transformers/tokenization_utils_fast.py", line 368, in _batch_encode_plus
is_pretokenized=is_pretokenized,
File "/home/moian/devel/HuggingFace/tokenizers/bindings/python/.env/lib/python3.7/site-packages/tokenizers-0.9.0.dev0-py3.7-linux-x86_64.egg/tokenizers/implementations/base_tokenizer.py", line 211, in encode
return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
pyo3_runtime.PanicException: assertion failed: max_len > 0This is happening because we need one of the sequences to be totally truncated (since there are special tokens), and at the moment we don't support truncating to 0. What do you think about this? |
|
@n1t0 Thanks for taking the time to look at this. I can certainly add the assertion back in. However, before doing so I'd like to make the case that the proposed behavior is inconsistent.
>>> from transformers import BertTokenizer
>>> tokz = BertTokenizer.from_pretrained("bert-base-cased")
>>> tok_slow.batch_encode_plus([("ozeirgrezjgorjoij", "reoijgoirjg")], max_length=4, truncation="longest_first")
# {'input_ids': [[101, 24919, 102, 102]], 'token_type_ids': [[0, 0, 0, 1]], 'attention_mask': [[1, 1, 1, 1]]}
from transformers import BertTokenizer, BertTokenizerFast
tokz = BertTokenizer.from_pretrained('bert-base-uncased')
tokz_fast = BertTokenizerFast.from_pretrained('bert-base-uncased')
empty = ''
short = 'the ' * 509
long = 'the ' * 510
# Case 1: no truncation, no error
tokz(empty, empty, truncation='longest_first', max_length=512)
tokz_fast(empty, empty, truncation='longest_first', max_length=512)
# Case 2: no truncation, no error
tokz(empty, short, truncation='longest_first', max_length=512)
tokz_fast(empty, short, truncation='longest_first', max_length=512)
# Case 3: truncation, no error
tokz(long, long, truncation='longest_first', max_length=512)
tokz_fast(long, long, truncation='longest_first', max_length=512)
# Case 4: truncation, Truncation error from BertTokenizerFast only
tokz(empty, long, truncation='longest_first', max_length=512)
tokz_fast(empty, long, truncation='longest_first', max_length=512)From this perspective I'd propose allowing empty input in all cases, as well as truncation to length zero. Let me know your thoughts, and if you still think the assertion should be added back in. |
|
Yes, I totally agree with you, and the changes you are proposing. When first writing |
Awesome! Glad we could come to a consensus.
If you'd like those changes to happen before merging this, I'm happy to look into it, and roll them into this PR. For clarification, do you basically mean changing Also, wanted to take a moment to thank you @n1t0 and the rest of the huggingface team for building awesome stuff. |
Yes, exactly! All the code that follows the second
Awesome! Feel free to include everything in this PR.
Thank you very much, we appreciate the kind words! |
|
Hi @thomlake! We will make a new release soon, probably next week, and I think it would be awesome to have this included. |
|
Hey @n1t0, sorry for the delay. I'll push the requested changes today or tomorrow. |
6f644cd to
22a63de
Compare
tokenizers/src/tokenizer/encoding.rs
Outdated
| assert!(max_len > 0); | ||
|
|
||
| if max_len == 0 { | ||
| let o = Encoding { |
There was a problem hiding this comment.
I'm not sure if this is the correct/best way to do this. I'm not very familiar with Rust, so was mostly trying things and following the guidance from the compiler.
|
@n1t0 I've pushed a commit to support zero length truncation. I left a comment on a piece of code I'm not sure of (this is my first time working with Rust). I have to admit, after implementing the changes I'm a bit unsure of how sensible the latest commit is. In particular, I think I was conflating supporting zero length input, and truncating to length zero. I can imagine cases where the former makes sense, and would be easy for a user to catch and avoid if unintended by checking Let me know what you think. I'm happy to make any changes if needed, or to simply have this merged in. |
|
Thank you very much for taking care of this @thomlake! I just pushed a quick fix for the part where you weren't sure: we can indeed avoid cloning the data in a new
I agree, and at first was also thinking there was no good use case for truncating to zero length. Now, it shouldn't happen often though, and being consistent with transformers is important, so I'm happy with this. |
|
Awesome, thanks @n1t0.
TIL!
Yeah, I imagined consistency was most important. Let me know if there is anything else you need from me before merging. I appreciate the efforts and time on this! |
tokenizers/src/utils/truncation.rs
Outdated
| ) | ||
| } | ||
|
|
||
| fn truncatate_and_assert( |
|
Was just waiting to make sure the checks were still good. Merging now. Thank you! |
This PR addresses a consistency issue with the handling of empty input when truncating pairs with the
TruncationStrategy::LongestFirst, #381 and huggingface/transformers#6669. It also makes the algorithm constant time and adds tests (previously there were none).