-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🚀 Feature
DataLoader for large corpus file, supporting multi-process and multi-thread and memory optimization
Motivation
Thanks a lot for your pytorch framework, I've benefited a lot in work by using it. But I also got some little confusions with regard to data loading and processing.
In NLP cases, suppose I have a very large corpus file with labeled data "corpus.csv", which is too large to load it into memery once time, or even it is infinite large. I want to train my model on this dataset from top to bottom by mini batches. That means in each batch, I yield certain text lines from file begining to file ending to train my model (cannot support shuffle). I can do this with python generator easily. However, I don't know how to write subclass of torch.utils.data.Dataset and use torch.utils.data.DataLoader to customize my own dataset on "corpus.csv". Since "corpus.csv" cannot be loaded into memory totally, I can not write __getitem__ and __len__ attribute.
I would be grateful if you could develop a module that satisfy the cases above ? Since merely by python generator, it is too hard for me to write DataLoader supporting multi-process and multi-thread loading and memory optimization.
Alternatives
By multi processing, one can be training the model on this batch and loading the next batch at the same time.