DataLoader for Large Corpus File

## 🚀 Feature
&emsp; DataLoader for large corpus file, supporting multi-process and multi-thread and memory optimization 

## Motivation
&emsp; Thanks a lot for your pytorch framework, I've benefited a lot in work by using it. But I also got some little confusions with regard to data loading and processing. 
&emsp; In NLP cases, suppose I have a very large corpus file with labeled data "corpus.csv", which is too large to load it into memery once time, or even it is infinite large. I want to train my model on this dataset from top to bottom by mini batches. That means in each batch, I yield certain text lines from file begining to file ending to train my model (cannot support shuffle). I can do this with python generator easily. However, I don't know how to write subclass of  `torch.utils.data.Dataset` and use `torch.utils.data.DataLoader` to customize my own dataset on "corpus.csv". Since "corpus.csv" cannot be loaded into memory totally, I can not write `__getitem__` and `__len__` attribute. 
&emsp; I would be grateful if you could develop a module that satisfy the cases above ? Since merely by python generator, it is too hard for me to write `DataLoader` supporting multi-process and multi-thread loading and memory optimization.

## Alternatives
&emsp; By multi processing, one can be training the model on this batch and loading the next batch at the same time.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataLoader for Large Corpus File #19946

🚀 Feature

Motivation

Alternatives

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataLoader for Large Corpus File #19946

Description

🚀 Feature

Motivation

Alternatives

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions