Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

WIP Add fast rust tokenizers#205

Closed
tholor wants to merge 3 commits intomasterfrom
fast_tokenizers
Closed

WIP Add fast rust tokenizers#205
tholor wants to merge 3 commits intomasterfrom
fast_tokenizers

Conversation

@tholor
Copy link
Copy Markdown
Member

@tholor tholor commented Jan 22, 2020

Let's see if we can get the new, fast tokenizers from huggingface integrated in a nice way :)
Speed seems promising and could help to solve #157

@stale
Copy link
Copy Markdown

stale bot commented Jun 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

@stale stale bot added the stale label Jun 6, 2020
@Timoeller
Copy link
Copy Markdown
Contributor

We will continue on this PR. Having faster data processing is relevant to us.

@stale stale bot removed the stale label Jun 9, 2020
@tholor
Copy link
Copy Markdown
Member Author

tholor commented Jun 24, 2020

The serialization issue that prevented multiprocessing (and therefore integration in FARM) has been resolved in huggingface/tokenizers#272. The basic test of fast tokenization seems to work within this branch now 🎉 .

Let's move ahead once there's a new pypi release of tokenizers and the version is also upgraded in transformers.

@tholor
Copy link
Copy Markdown
Member Author

tholor commented Jun 24, 2020

From a quick check, the offsets implemented in tokenizers can also deal with special chars / [UNK] (see #421).

from tokenizers import (ByteLevelBPETokenizer,
                         CharBPETokenizer,
                          SentencePieceBPETokenizer,
                          BertWordPieceTokenizer)

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
text = "Hello, y'all! How are you 😁 ?"
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids, output.tokens, output.offsets)
print(output.tokens[10])
print(output.offsets[10])
print(text[26:27])
print(text[0:5])

@Timoeller
Copy link
Copy Markdown
Contributor

It seems that RUST tokenizers are officially in transformers. https://github.com/huggingface/transformers/releases/tag/v3.0.0

Am I right in assuming we can use this functionality and fix #420 + #421 ? Would love to see some progress here.

@tholor
Copy link
Copy Markdown
Member Author

tholor commented Jun 29, 2020

Yes, seems good to go. The tokenizers version in transformers is now pinned to 0.8.0rc4 (which should already have proper serialization)

@PhilipMay
Copy link
Copy Markdown
Contributor

PhilipMay commented Aug 1, 2020

Damn. I just saw I had the same in mind with this PR: #482

My suggestion / offer: I can continue with PR #482 PR and will try to merge the stuff from here

PhilipMay added a commit to PhilipMay/FARM that referenced this pull request Aug 1, 2020
@Timoeller
Copy link
Copy Markdown
Contributor

Yes totally. Feel free to re use the tests in your PR.

@tholor
Copy link
Copy Markdown
Member Author

tholor commented Aug 26, 2020

Closing this one as we continued in #482

@tholor tholor closed this Aug 26, 2020
tholor added a commit that referenced this pull request Sep 2, 2020
* Add option to use fast HF tokenizer

* Hand merge tests from PR #205

* test_inferencer_with_fast_bert_tokenizer

* test_fast_bert_tokenizer

* test_fast_bert_tokenizer_strip_accents

* test_fast_electra_tokenizer

* Fix OOM issue of CI

- set num_processes=0 for Inferencer

* Extend test for fast tokenizer

- electra
- roberta

* test_fast_tokenizer for more model typed

- electra
- roberta

* Fix tokenize_with_metadata

* Split tokenizer tests

* Fix pytest params bug in test_tok

* Fix fast tokenizer usage

* add missing newline eof

* Add test fast tok. doc_callif.

* Remove RobertaTokenizerFast

* Fix Tokenizer load and save.

* Fix typo

* Improve test test_embeddings_extraction

- add shape assert
- fix embedding assert

* Dosctring for fast tokenizers improved

* tokenizer_args docstring

* Extend test_embeddings_extraction to fast tok.

* extend test_ner with fast tok.

* fix sample_to_features_ner for fast tokenizer

* temp fix for is_pretokenized until fixed upstream

* Make use of fast tokenizer possible + fix bug in offset calculation

* Make fast tokenization possible with NER, LM and QA

* Change error messages

* Add tests

* update error messages, comments and truncation arg in tokenizer

Co-authored-by: Malte Pietsch <[email protected]>
Co-authored-by: Bogdan Kostić <[email protected]>
Timoeller pushed a commit that referenced this pull request Dec 23, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants