Slow Tokenizer if custom vocab was added

**Describe the bug**
Tokenizer becomes very slow with large custom vocab.

**Additional context**
This was introduced after switching to the Tokenizers from the transformers repo 

There are related issues reported in the transformers repo: 
- https://github.com/huggingface/transformers/issues/1830
- https://github.com/huggingface/transformers/issues/1621
- https://github.com/huggingface/transformers/pull/1881

**To Reproduce**
- Add custom vocab to tokenizer via tokenizer.add_tokens()
- Load some data into the data silo, e.g. run examples/lm_finetuning.py

**System:**
 - OS: Ubuntu 18.04
 - GPU/CPU: Both
 - FARM version: master @ 484d26c5d6b01eb5ee59b793cd6ee8597af8b808


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Tokenizer if custom vocab was added #157

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow Tokenizer if custom vocab was added #157

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions