Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Slow Tokenizer if custom vocab was added #157

@tholor

Description

@tholor

Describe the bug
Tokenizer becomes very slow with large custom vocab.

Additional context
This was introduced after switching to the Tokenizers from the transformers repo

There are related issues reported in the transformers repo:

To Reproduce

  • Add custom vocab to tokenizer via tokenizer.add_tokens()
  • Load some data into the data silo, e.g. run examples/lm_finetuning.py

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: Both
  • FARM version: master @ 484d26c

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstale

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions