WIP Add fast rust tokenizers by tholor · Pull Request #205 · deepset-ai/FARM

tholor · 2020-01-22T08:09:05Z

Let's see if we can get the new, fast tokenizers from huggingface integrated in a nice way :)
Speed seems promising and could help to solve #157

stale · 2020-06-06T01:44:40Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs.

Timoeller · 2020-06-09T10:23:24Z

We will continue on this PR. Having faster data processing is relevant to us.

tholor · 2020-06-24T09:13:50Z

The serialization issue that prevented multiprocessing (and therefore integration in FARM) has been resolved in huggingface/tokenizers#272. The basic test of fast tokenization seems to work within this branch now 🎉 .

Let's move ahead once there's a new pypi release of tokenizers and the version is also upgraded in transformers.

tholor · 2020-06-24T09:50:07Z

From a quick check, the offsets implemented in tokenizers can also deal with special chars / [UNK] (see #421).

from tokenizers import (ByteLevelBPETokenizer,
                         CharBPETokenizer,
                          SentencePieceBPETokenizer,
                          BertWordPieceTokenizer)

tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
text = "Hello, y'all! How are you 😁 ?"
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.ids, output.tokens, output.offsets)
print(output.tokens[10])
print(output.offsets[10])
print(text[26:27])
print(text[0:5])

Timoeller · 2020-06-29T17:10:24Z

It seems that RUST tokenizers are officially in transformers. https://github.com/huggingface/transformers/releases/tag/v3.0.0

Am I right in assuming we can use this functionality and fix #420 + #421 ? Would love to see some progress here.

tholor · 2020-06-29T17:15:53Z

Yes, seems good to go. The tokenizers version in transformers is now pinned to 0.8.0rc4 (which should already have proper serialization)

PhilipMay · 2020-08-01T12:04:14Z

Damn. I just saw I had the same in mind with this PR: #482

My suggestion / offer: I can continue with PR #482 PR and will try to merge the stuff from here

Timoeller · 2020-08-03T10:54:26Z

Yes totally. Feel free to re use the tests in your PR.

tholor · 2020-08-26T10:50:40Z

Closing this one as we continued in #482

* Add option to use fast HF tokenizer * Hand merge tests from PR #205 * test_inferencer_with_fast_bert_tokenizer * test_fast_bert_tokenizer * test_fast_bert_tokenizer_strip_accents * test_fast_electra_tokenizer * Fix OOM issue of CI - set num_processes=0 for Inferencer * Extend test for fast tokenizer - electra - roberta * test_fast_tokenizer for more model typed - electra - roberta * Fix tokenize_with_metadata * Split tokenizer tests * Fix pytest params bug in test_tok * Fix fast tokenizer usage * add missing newline eof * Add test fast tok. doc_callif. * Remove RobertaTokenizerFast * Fix Tokenizer load and save. * Fix typo * Improve test test_embeddings_extraction - add shape assert - fix embedding assert * Dosctring for fast tokenizers improved * tokenizer_args docstring * Extend test_embeddings_extraction to fast tok. * extend test_ner with fast tok. * fix sample_to_features_ner for fast tokenizer * temp fix for is_pretokenized until fixed upstream * Make use of fast tokenizer possible + fix bug in offset calculation * Make fast tokenization possible with NER, LM and QA * Change error messages * Add tests * update error messages, comments and truncation arg in tokenizer Co-authored-by: Malte Pietsch <[email protected]> Co-authored-by: Bogdan Kostić <[email protected]>

Former-commit-id: a433483

tholor added 2 commits January 22, 2020 09:06

initial test of fast rust tokenizers

8066ee6

add test for fasttokenizer using custom vocab

35ce534

stale bot added the stale label Jun 6, 2020

stale bot removed the stale label Jun 9, 2020

merge latest master

bde3dba

PhilipMay mentioned this pull request Aug 1, 2020

Add option to use fast HF tokenizer. #482

Merged

4 tasks

PhilipMay added a commit to PhilipMay/FARM that referenced this pull request Aug 1, 2020

Hand merge tests from PR deepset-ai#205

a433483

tholor closed this Aug 26, 2020

Timoeller pushed a commit that referenced this pull request Dec 23, 2020

Hand merge tests from PR #205

0969c7d

Former-commit-id: a433483

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Add fast rust tokenizers#205

WIP Add fast rust tokenizers#205
tholor wants to merge 3 commits intomasterfrom
fast_tokenizers

tholor commented Jan 22, 2020

Uh oh!

stale bot commented Jun 6, 2020

Uh oh!

Timoeller commented Jun 9, 2020

Uh oh!

tholor commented Jun 24, 2020

Uh oh!

tholor commented Jun 24, 2020

Uh oh!

Timoeller commented Jun 29, 2020

Uh oh!

tholor commented Jun 29, 2020

Uh oh!

PhilipMay commented Aug 1, 2020 •

edited

Loading

Uh oh!

Timoeller commented Aug 3, 2020

Uh oh!

tholor commented Aug 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tholor commented Jan 22, 2020

Uh oh!

stale bot commented Jun 6, 2020

Uh oh!

Timoeller commented Jun 9, 2020

Uh oh!

tholor commented Jun 24, 2020

Uh oh!

tholor commented Jun 24, 2020

Uh oh!

Timoeller commented Jun 29, 2020

Uh oh!

tholor commented Jun 29, 2020

Uh oh!

PhilipMay commented Aug 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Timoeller commented Aug 3, 2020

Uh oh!

tholor commented Aug 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PhilipMay commented Aug 1, 2020 •

edited

Loading