tokenization slow

## ❓ Questions & Help


Hi,
I want to fine-tune the gpt2 model with a very large corpus (~9GB text data)
However, the tokenization of run_lm_finetuning.py takes forever (what is not surprising with a 9GB text file)
My question is: is there any way to speed up the tokenization like multiprocessing, or do I have to break up my training file and train with a sample?

Best regards

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization slow #1621

❓ Questions & Help

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

tokenization slow #1621

Description

❓ Questions & Help

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions