Skip to content

DataLoader for Large Corpus File #19946

@zegzag

Description

@zegzag

🚀 Feature

  DataLoader for large corpus file, supporting multi-process and multi-thread and memory optimization

Motivation

  Thanks a lot for your pytorch framework, I've benefited a lot in work by using it. But I also got some little confusions with regard to data loading and processing.
  In NLP cases, suppose I have a very large corpus file with labeled data "corpus.csv", which is too large to load it into memery once time, or even it is infinite large. I want to train my model on this dataset from top to bottom by mini batches. That means in each batch, I yield certain text lines from file begining to file ending to train my model (cannot support shuffle). I can do this with python generator easily. However, I don't know how to write subclass of torch.utils.data.Dataset and use torch.utils.data.DataLoader to customize my own dataset on "corpus.csv". Since "corpus.csv" cannot be loaded into memory totally, I can not write __getitem__ and __len__ attribute.
  I would be grateful if you could develop a module that satisfy the cases above ? Since merely by python generator, it is too hard for me to write DataLoader supporting multi-process and multi-thread loading and memory optimization.

Alternatives

  By multi processing, one can be training the model on this batch and loading the next batch at the same time.

Metadata

Metadata

Assignees

Labels

featureA request for a proper, new feature.module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions