-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
module: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
For TarArchiveReader and ZipArchiveReader, we attach the archive file stream to each file stream here:
pytorch/torch/utils/data/datapipes/iter/tararchivereader.py
Lines 55 to 57 in 0ef8760
| # Add a reference of the source tarfile into extracted_fobj, so the source | |
| # tarfile handle won't be released until all the extracted file objs are destroyed. | |
| extracted_fobj.source_ref = tar # type: ignore[attr-defined] |
The main reason is preventing archive file stream is closed by gc:
If we deplete the archive reader DataPipe using list(dp), all the file streams within the list are actually closed if we do not attach the archive file stream. And, here is the test to validate it:
Lines 225 to 232 in 0ef8760
| # read extracted files after reaching the end of the tarfile | |
| data_refs = list(datapipe3) | |
| self.assertEqual(len(data_refs), len(self.temp_files)) | |
| for data_ref, temp_file in zip(data_refs, self.temp_files): | |
| self.assertEqual(os.path.basename(data_ref[0]), os.path.basename(temp_file)) | |
| with open(temp_file, 'rb') as f: | |
| self.assertEqual(data_ref[1].read(), f.read()) | |
| data_ref[1].close() |
As a result, like the tests here:
Line 207 in 0ef8760
| # TODO(VitalyFedyunin): Generates unclosed buffer warning, need to investigate |
A warning is raised because we are relying on
gc to close the archive stream at the end.
Two options:
- Remove reference of archive stream from each file stream, but somehow still make
list(dp)working - Find a way to eliminate the opened file stream warning in these tests (TODOs)
Metadata
Metadata
Assignees
Labels
module: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module