Skip to content

IterableDataset should be able to provide a length #30184

@PCerles

Description

@PCerles

I'm using the IterableDataset class to iterate through a list of tar files. Roughly, it works like

def__init__(self, chunk_list):
       self.chunk_list = chunk_list
       NUM_FILES_IN_TAR_FILE = 4096
       self.length = NUM_FILES_IN_TAR_FILE * len(chunk_list)
def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:
            chunk_list = self.chunk_list
        else:
            start_index = worker_info.id
            num_workers = worker_info.num_workers
            chunk_list = self.chunk_list[start_index::num_workers]
        while True:
            for chunk in chunk_list: # e.g. 000009.tar
                tar = tarfile.open(chunk)
                for file in tar.getmembers():
                      # do something
                      yield data
def __len__(self):
     return self.length

Since I know the chunk size a priori, I know the length of my dataset, and want to use the length to keep track of how far I am through an epoch, and because it plays nicely with Pytorch Ignite. In my opinion, this should be left to the user to choose to implement.

Metadata

Metadata

Assignees

Labels

enhancementNot as big of a feature, but technically not a bug. Should be easy to fixtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions