convert list to set in tokenize().split_on_tokens() by 578123043 · Pull Request #1881 · huggingface/transformers

578123043 · 2019-11-20T07:08:40Z

As issue 1830, I meet the same question
when i add some special_tokens in Tokenizer.
But I think it is property self.all_special_tokens that slow the code.
property self.all_special_tokens will be called so many time when we added some special token.

An easy way to solve this problem is to create a temporary Set.

In my implementation, it faster about 10 times when 207 special tokens are added, I do not get a precise number because of multiprocessing : )

codecov-io · 2019-11-20T07:15:22Z

Codecov Report

Merging #1881 into master will increase coverage by 1.35%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #1881      +/-   ##
==========================================
+ Coverage   82.72%   84.08%   +1.35%     
==========================================
  Files          97       97              
  Lines       14316    14316              
==========================================
+ Hits        11843    12037     +194     
+ Misses       2473     2279     -194

Impacted Files	Coverage Δ
transformers/tokenization_utils.py	`92.14% <100%> (ø)`	⬆️
transformers/modeling_openai.py	`82% <0%> (+1.33%)`	⬆️
transformers/modeling_ctrl.py	`96.46% <0%> (+2.21%)`	⬆️
transformers/modeling_xlnet.py	`73.61% <0%> (+2.43%)`	⬆️
transformers/modeling_roberta.py	`71.76% <0%> (+12.35%)`	⬆️
transformers/tests/modeling_tf_common_test.py	`97.08% <0%> (+15.53%)`	⬆️
transformers/modeling_tf_pytorch_utils.py	`92.95% <0%> (+83.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3386d9...821ba9e. Read the comment docs.

LysandreJik · 2019-11-20T15:11:01Z

Hi, thanks for looking into it! What's your use-case for adding 207 special tokens?

578123043 · 2019-11-20T15:19:24Z

In the kaggle Tensorflow 2 Nature question competition. I try to add some additional Sequence Embedding Such as [Tableid=13] and split short sentence.

LysandreJik · 2019-11-20T15:52:14Z

I may misunderstand, but why not use the add_tokens method rather than the add_special_tokens method, which is reserved for tokens like CLS or MASK?

thomwolf · 2019-12-05T09:11:26Z

Yes, add_special_tokens method is reserved for a limited number of tokens with special properties and usage like CLS or MASK. For other uses, go for add_tokens.

salmanmashayekh · 2020-02-27T21:51:51Z

Here is how we solved the performance issue when adding custom vocabulary: In the add_tokens method, we simply integrate new_tokens into the self.vocab.

from transformers import BertTokenizer, WordpieceTokenizer
from collections import OrderedDict


class CustomVocabBertTokenizer(BertTokenizer):
    def add_tokens(self, new_tokens):
        new_tokens = [token for token in tokens if not (token in self.vocab or token in self.all_special_tokens)]

        self.vocab = OrderedDict([
            *self.vocab.items(),
            *[
                (token, i + len(self.vocab))
                for i, token in enumerate(new_tokens)
            ]
        ])

        self.ids_to_tokens = OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)

        return len(new_tokens)

convert list to set in tokenize().split_on_tokens()

821ba9e

tholor mentioned this pull request Nov 20, 2019

Slow Tokenizer if custom vocab was added deepset-ai/FARM#157

Closed

iedmrc mentioned this pull request Dec 4, 2019

Add vocabulary gives sequence length warning #1533

Closed

thomwolf closed this Dec 5, 2019

mandubian mentioned this pull request Dec 14, 2019

:zip: #2106 tokenizer.tokenize speed improvement (3-8x) by caching added_tokens in a Set #2177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert list to set in tokenize().split_on_tokens()#1881

convert list to set in tokenize().split_on_tokens()#1881
578123043 wants to merge 1 commit intohuggingface:masterfrom
578123043:master

578123043 commented Nov 20, 2019 •

edited

Loading

Uh oh!

codecov-io commented Nov 20, 2019 •

edited

Loading

Uh oh!

LysandreJik commented Nov 20, 2019

Uh oh!

578123043 commented Nov 20, 2019

Uh oh!

LysandreJik commented Nov 20, 2019 •

edited

Loading

Uh oh!

thomwolf commented Dec 5, 2019

Uh oh!

salmanmashayekh commented Feb 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

578123043 commented Nov 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented Nov 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

LysandreJik commented Nov 20, 2019

Uh oh!

578123043 commented Nov 20, 2019

Uh oh!

LysandreJik commented Nov 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomwolf commented Dec 5, 2019

Uh oh!

salmanmashayekh commented Feb 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

578123043 commented Nov 20, 2019 •

edited

Loading

codecov-io commented Nov 20, 2019 •

edited

Loading

LysandreJik commented Nov 20, 2019 •

edited

Loading