Conversation
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. |
|
We will continue on this PR. Having faster data processing is relevant to us. |
|
The serialization issue that prevented multiprocessing (and therefore integration in FARM) has been resolved in huggingface/tokenizers#272. The basic test of fast tokenization seems to work within this branch now 🎉 . Let's move ahead once there's a new pypi release of |
|
From a quick check, the offsets implemented in |
|
It seems that RUST tokenizers are officially in transformers. https://github.com/huggingface/transformers/releases/tag/v3.0.0 Am I right in assuming we can use this functionality and fix #420 + #421 ? Would love to see some progress here. |
|
Yes, seems good to go. The tokenizers version in transformers is now pinned to 0.8.0rc4 (which should already have proper serialization) |
|
Yes totally. Feel free to re use the tests in your PR. |
|
Closing this one as we continued in #482 |
* Add option to use fast HF tokenizer * Hand merge tests from PR #205 * test_inferencer_with_fast_bert_tokenizer * test_fast_bert_tokenizer * test_fast_bert_tokenizer_strip_accents * test_fast_electra_tokenizer * Fix OOM issue of CI - set num_processes=0 for Inferencer * Extend test for fast tokenizer - electra - roberta * test_fast_tokenizer for more model typed - electra - roberta * Fix tokenize_with_metadata * Split tokenizer tests * Fix pytest params bug in test_tok * Fix fast tokenizer usage * add missing newline eof * Add test fast tok. doc_callif. * Remove RobertaTokenizerFast * Fix Tokenizer load and save. * Fix typo * Improve test test_embeddings_extraction - add shape assert - fix embedding assert * Dosctring for fast tokenizers improved * tokenizer_args docstring * Extend test_embeddings_extraction to fast tok. * extend test_ner with fast tok. * fix sample_to_features_ner for fast tokenizer * temp fix for is_pretokenized until fixed upstream * Make use of fast tokenizer possible + fix bug in offset calculation * Make fast tokenization possible with NER, LM and QA * Change error messages * Add tests * update error messages, comments and truncation arg in tokenizer Co-authored-by: Malte Pietsch <[email protected]> Co-authored-by: Bogdan Kostić <[email protected]>
Let's see if we can get the new, fast tokenizers from huggingface integrated in a nice way :)
Speed seems promising and could help to solve #157