Skip to content

convert list to set in tokenize().split_on_tokens()#1881

Closed
578123043 wants to merge 1 commit intohuggingface:masterfrom
578123043:master
Closed

convert list to set in tokenize().split_on_tokens()#1881
578123043 wants to merge 1 commit intohuggingface:masterfrom
578123043:master

Conversation

@578123043
Copy link
Copy Markdown

@578123043 578123043 commented Nov 20, 2019

As issue 1830, I meet the same question
when i add some special_tokens in Tokenizer.
But I think it is property self.all_special_tokens that slow the code.
property self.all_special_tokens will be called so many time when we added some special token.

An easy way to solve this problem is to create a temporary Set.

In my implementation, it faster about 10 times when 207 special tokens are added, I do not get a precise number because of multiprocessing : )

@codecov-io
Copy link
Copy Markdown

codecov-io commented Nov 20, 2019

Codecov Report

Merging #1881 into master will increase coverage by 1.35%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1881      +/-   ##
==========================================
+ Coverage   82.72%   84.08%   +1.35%     
==========================================
  Files          97       97              
  Lines       14316    14316              
==========================================
+ Hits        11843    12037     +194     
+ Misses       2473     2279     -194
Impacted Files Coverage Δ
transformers/tokenization_utils.py 92.14% <100%> (ø) ⬆️
transformers/modeling_openai.py 82% <0%> (+1.33%) ⬆️
transformers/modeling_ctrl.py 96.46% <0%> (+2.21%) ⬆️
transformers/modeling_xlnet.py 73.61% <0%> (+2.43%) ⬆️
transformers/modeling_roberta.py 71.76% <0%> (+12.35%) ⬆️
transformers/tests/modeling_tf_common_test.py 97.08% <0%> (+15.53%) ⬆️
transformers/modeling_tf_pytorch_utils.py 92.95% <0%> (+83.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3386d9...821ba9e. Read the comment docs.

@LysandreJik
Copy link
Copy Markdown
Member

Hi, thanks for looking into it! What's your use-case for adding 207 special tokens?

@578123043
Copy link
Copy Markdown
Author

In the kaggle Tensorflow 2 Nature question competition. I try to add some additional Sequence Embedding Such as [Tableid=13] and split short sentence.

@LysandreJik
Copy link
Copy Markdown
Member

LysandreJik commented Nov 20, 2019

I may misunderstand, but why not use the add_tokens method rather than the add_special_tokens method, which is reserved for tokens like CLS or MASK?

@thomwolf
Copy link
Copy Markdown
Member

thomwolf commented Dec 5, 2019

Yes, add_special_tokens method is reserved for a limited number of tokens with special properties and usage like CLS or MASK. For other uses, go for add_tokens.

@salmanmashayekh
Copy link
Copy Markdown

Here is how we solved the performance issue when adding custom vocabulary: In the add_tokens method, we simply integrate new_tokens into the self.vocab.

from transformers import BertTokenizer, WordpieceTokenizer
from collections import OrderedDict


class CustomVocabBertTokenizer(BertTokenizer):
    def add_tokens(self, new_tokens):
        new_tokens = [token for token in tokens if not (token in self.vocab or token in self.all_special_tokens)]

        self.vocab = OrderedDict([
            *self.vocab.items(),
            *[
                (token, i + len(self.vocab))
                for i, token in enumerate(new_tokens)
            ]
        ])

        self.ids_to_tokens = OrderedDict([(ids, tok) for tok, ids in self.vocab.items()])
        self.wordpiece_tokenizer = WordpieceTokenizer(vocab=self.vocab, unk_token=self.unk_token)

        return len(new_tokens)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants