Skip to content

Unify the output pathname of archive reader and extractor#65424

Closed
ejguan wants to merge 1 commit intogh/ejguan/92/basefrom
gh/ejguan/92/head
Closed

Unify the output pathname of archive reader and extractor#65424
ejguan wants to merge 1 commit intogh/ejguan/92/basefrom
gh/ejguan/92/head

Conversation

@ejguan
Copy link
Copy Markdown
Contributor

@ejguan ejguan commented Sep 21, 2021

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Sep 21, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit de83421 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ejguan added a commit that referenced this pull request Sep 21, 2021
@ejguan
Copy link
Copy Markdown
Contributor Author

ejguan commented Sep 21, 2021

@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ejguan merged this pull request in 96383ca.

@facebook-github-bot facebook-github-bot deleted the gh/ejguan/92/head branch September 25, 2021 14:19
NivekT pushed a commit that referenced this pull request Sep 30, 2021
Summary:
Pull Request resolved: #65424

This PR is re-implementation for https://github.com/facebookexternal/torchdata/pull/93
Same PR has landed into torchdata https://github.com/facebookexternal/torchdata/pull/157

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31090447

Pulled By: ejguan

fbshipit-source-id: 45af1ad9b24310bebfd6e010f41cff398946ba65
malfet pushed a commit that referenced this pull request Oct 6, 2021
…5932)

* Unify the output pathname of archive reader and extractor (#65424)

Summary:
Pull Request resolved: #65424

This PR is re-implementation for https://github.com/facebookexternal/torchdata/pull/93
Same PR has landed into torchdata https://github.com/facebookexternal/torchdata/pull/157

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D31090447

Pulled By: ejguan

fbshipit-source-id: 45af1ad9b24310bebfd6e010f41cff398946ba65

* [DatePipe] add deprecation warnings for DataPipes that will solely exist in TorchData (#65827)

Summary: Pull Request resolved: #65827

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31272794

Pulled By: NivekT

fbshipit-source-id: 8da8266184b4df050422904cbc5fca6d7c3d2e02

* [DataPipe] Fixes an issue where TarArchiveReader closes stream when read into a buffer (#65877)

Summary:
Pull Request resolved: #65877

Fixes #65808

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31296041

Pulled By: NivekT

fbshipit-source-id: cdcad3a333ae9781d6063678a122a128955b0ff4

Co-authored-by: Erjia Guan <[email protected]>
@pmeier
Copy link
Copy Markdown
Collaborator

pmeier commented Oct 18, 2021

I would consider the new behavior a bug or at least a bad UX. Let's say I have the following setup

foo/
└── bar.tar
    └── spam.txt

Using TarArchiveReader, I now get foo/spam.txt as path. This is misleading, since this file does not exist. Of course foo/bar.tar/spam.txt also doesn't exists, but it at least gives you the indication, that the file is inside the archive. Furthermore, we lose the name of the archive which might be needed for further processing.

This gets even worse if we have nested archives:

foo
└── bar.tar
    └── baz.tar
        └── spam.txt

This will now also result foo/spam.txt, whereas before we would get foo/bar.tar/baz.tar/spam.txt acknowledging the nested archives.

@ejguan
Copy link
Copy Markdown
Contributor Author

ejguan commented Oct 18, 2021

@pmeier

Using TarArchiveReader, I now get foo/spam.txt as path.

It should be foo/bar/spam.txt. Without this PR, the returned path is foo/bar.tar/bar/spam.txt.

@pmeier
Copy link
Copy Markdown
Collaborator

pmeier commented Oct 18, 2021

It should be foo/bar/spam.txt.

  1. That would still be misleading, since the directory foo/bar doesn't exist.
  2. That is not the case:
tmp/foo/
└── bar.tar
    └── spam.txt
from torch.utils.data.datapipes.iter import TarArchiveReader, FileLister, FileLoader

dp = FileLister("/tmp/foo")
dp = FileLoader(dp)
dp = TarArchiveReader(dp)

print(tuple(dp))
(('/tmp/foo/spam.txt', <ExFileObject name='/tmp/foo/bar.tar'>),)

tmp/foo/
└── bar.tar
    └── baz.tar
        └── spam.txt
from torch.utils.data.datapipes.iter import TarArchiveReader, FileLister, FileLoader

dp = FileLister("/tmp/foo")
dp = FileLoader(dp)
dp = TarArchiveReader(dp)
dp = TarArchiveReader(dp)

print(tuple(dp))
(('/tmp/foo/spam.txt', <ExFileObject name='/tmp/foo/bar.tar'>),)

To get the your path, we need the following setup:

/tmp/foo
└── bar.tar
    └── bar
        └── spam.txt
from torch.utils.data.datapipes.iter import TarArchiveReader, FileLister, FileLoader

dp = FileLister("/tmp/foo")
dp = FileLoader(dp)
dp = TarArchiveReader(dp)

print(tuple(dp))
(('/tmp/foo/bar/spam.txt', <ExFileObject name='/tmp/foo/bar.tar'>),)

Without this PR, the returned path is foo/bar.tar/bar/spam.txt.

Nope. Rolling back the changes of this PR, we get the following output for the three setups above:

  1. (('/tmp/foo/bar.tar/spam.txt', <ExFileObject name='/tmp/foo/bar.tar'>),)
  2. (('/tmp/foo/bar.tar/baz.tar/spam.txt', <ExFileObject name='/tmp/foo/bar.tar'>),)
  3. (('/tmp/foo/bar.tar/bar/spam.txt', <ExFileObject name='/tmp/foo/bar.tar'>),)

@ejguan
Copy link
Copy Markdown
Contributor Author

ejguan commented Oct 18, 2021

I hope this is not something that varies across python version. Let me take a deeper look.

@ejguan
Copy link
Copy Markdown
Contributor Author

ejguan commented Oct 19, 2021

@pmeier
You are right. We should not do that. I was testing against the dataset from TorchText. The archive is created from a folder rather directly from multiple files like you mentioned.

/tmp/foo
└── bar.tar
    └── bar
        └── spam.txt

Thanks for that. I will send a patch shortly and fix the tests for TorchText

ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
ejguan added a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

cc: @Nayef211

Differential Revision: [D31797765](https://our.internmc.facebook.com/intern/diff/D31797765)

[ghstack-poisoned]
facebook-github-bot pushed a commit to meta-pytorch/data that referenced this pull request Oct 20, 2021
Summary:
Pull Request resolved: #73

Previous implementation of `ArchiveReader` has a bug. Please take a reference from pytorch/pytorch#65424 (comment)

Reviewed By: NivekT

Differential Revision: D31797765

fbshipit-source-id: 494e1a49b43d5a846de971a67586089e6d7ebafc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants