-
Notifications
You must be signed in to change notification settings - Fork 32.7k
Closed
Labels
Description
❓ Questions & Help
Hi,
I want to fine-tune the gpt2 model with a very large corpus (~9GB text data)
However, the tokenization of run_lm_finetuning.py takes forever (what is not surprising with a 9GB text file)
My question is: is there any way to speed up the tokenization like multiprocessing, or do I have to break up my training file and train with a sample?
Best regards
Reactions are currently unavailable