Skip to content

Investigate warning of unclosed ArchiveStream in DataPipe #64281

@ejguan

Description

@ejguan

For TarArchiveReader and ZipArchiveReader, we attach the archive file stream to each file stream here:

# Add a reference of the source tarfile into extracted_fobj, so the source
# tarfile handle won't be released until all the extracted file objs are destroyed.
extracted_fobj.source_ref = tar # type: ignore[attr-defined]

The main reason is preventing archive file stream is closed by gc:
If we deplete the archive reader DataPipe using list(dp), all the file streams within the list are actually closed if we do not attach the archive file stream. And, here is the test to validate it:

# read extracted files after reaching the end of the tarfile
data_refs = list(datapipe3)
self.assertEqual(len(data_refs), len(self.temp_files))
for data_ref, temp_file in zip(data_refs, self.temp_files):
self.assertEqual(os.path.basename(data_ref[0]), os.path.basename(temp_file))
with open(temp_file, 'rb') as f:
self.assertEqual(data_ref[1].read(), f.read())
data_ref[1].close()

As a result, like the tests here:

# TODO(VitalyFedyunin): Generates unclosed buffer warning, need to investigate

A warning is raised because we are relying on gc to close the archive stream at the end.

Two options:

  • Remove reference of archive stream from each file stream, but somehow still make list(dp) working
  • Find a way to eliminate the opened file stream warning in these tests (TODOs)

cc @ssnl @VitalyFedyunin @ejguan

Metadata

Metadata

Assignees

Labels

module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions