Skip to content

[DataLoader] Fix collation logic#97789

Closed
NivekT wants to merge 2 commits intogh/nivekt/57/basefrom
gh/nivekt/57/head
Closed

[DataLoader] Fix collation logic#97789
NivekT wants to merge 2 commits intogh/nivekt/57/basefrom
gh/nivekt/57/head

Conversation

@NivekT
Copy link
Copy Markdown
Contributor

@NivekT NivekT commented Mar 28, 2023

Stack from ghstack (oldest at bottom):

Similar to #97737, a previous auto-refactor changed how bytes are handled during collation, which can potentially lead to performance regression. This PR undoes that.

@pytorch-bot pytorch-bot bot added the release notes: dataloader release notes category label Mar 28, 2023
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/97789

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit bc0d041:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@NivekT NivekT requested a review from albanD March 28, 2023 15:40
@NivekT NivekT added the topic: improvements topic category label Mar 28, 2023
Copy link
Copy Markdown
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Should we have a one off test for this?

FYI @malfet

@NivekT
Copy link
Copy Markdown
Contributor Author

NivekT commented Mar 28, 2023

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 28, 2023
@NivekT NivekT added this to the 2.0.1 milestone Mar 28, 2023
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

Similar to #97737, a previous auto-refactor changed how `bytes` are handled during collation, which can potentially lead to performance regression. This PR undoes that.

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Successfully rebased gh/nivekt/57/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/97789)

pytorchmergebot pushed a commit that referenced this pull request Mar 28, 2023
ghstack-source-id: 126f4d7
Pull Request resolved: #97789
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Mar 30, 2023
Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`.

Both `str` and `bytes` are `Sequence` classes.

```python
In [1]: from typing import Sequence

In [2]: issubclass(bytes, Sequence)
Out[2]: True

In [3]: issubclass(str, Sequence)
Out[3]: True
```

Re-add `bytes` to type guards like:

```python
def is_seq(obj):
    return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes))
```

Ref:

- #94709 (comment)
- #97737
- #97789
Pull Request resolved: #97863
Approved by: https://github.com/Skylion007, https://github.com/albanD
XuehaiPan pushed a commit to XuehaiPan/pytorch that referenced this pull request Mar 31, 2023
Similar to pytorch#97737, a previous auto-refactor changed how `bytes` are handled during collation, which can potentially lead to performance regression. This PR undoes that.
Pull Request resolved: pytorch#97789
Approved by: https://github.com/albanD
XuehaiPan added a commit to XuehaiPan/pytorch that referenced this pull request Mar 31, 2023
…97863)

Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`.

Both `str` and `bytes` are `Sequence` classes.

```python
In [1]: from typing import Sequence

In [2]: issubclass(bytes, Sequence)
Out[2]: True

In [3]: issubclass(str, Sequence)
Out[3]: True
```

Re-add `bytes` to type guards like:

```python
def is_seq(obj):
    return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes))
```

Ref:

- pytorch#94709 (comment)
- pytorch#97737
- pytorch#97789
Pull Request resolved: pytorch#97863
Approved by: https://github.com/Skylion007, https://github.com/albanD
NivekT added a commit that referenced this pull request Apr 4, 2023
Similar to #97737, a previous auto-refactor changed how `bytes` are handled during collation, which can potentially lead to performance regression. This PR undoes that.
Pull Request resolved: #97789
Approved by: https://github.com/albanD
atalman pushed a commit that referenced this pull request Apr 5, 2023
…97789, #97863) (#98055)

* [DataLoader] Short circuit pin_memory recursion when operating on bytes (#97737)

Slack thread: https://pytorch.slack.com/archives/GEEQ2K4MD/p1679962409906099

I was seeing some massive (~2x) slowdowns on a job after running it on PyTorch 2.0. From some profiling in `py-spy` it looked like the pin_memory thread was doing a lot more work than before. Looking at a trace in `nsys` I saw the thread doing the forward pass having a bunch of `pthread_cond_timedwait` with GIL reacquire calls in it’s call stack, and it seemed like the thread doing the forward pass was getting blocked (waiting for the GIL) by the pin memory thread (which was holding the GIL).

After some debugging I found out the issue. If a `bytes` was passed into `pin_memory`, previously in 1.13 (before #94709) it would short-circuit and return here
https://github.com/pytorch/pytorch/blob/d922c29a22e4bf0fba49526f7536395eb8cd66f4/torch/utils/data/_utils/pin_memory.py#L54-L55
since `bytes` was in `torch._six.string_classes`:
```
>>> from torch._six import string_classes
>>> string_classes
(<class 'str'>, <class 'bytes'>)
>>>
```

However after #94709, if a `bytes` was passed into `pin_memory` it would fall into here instead
https://github.com/pytorch/pytorch/blob/c263bd43e8e8502d4726643bc6fd046f0130ac0e/torch/utils/data/_utils/pin_memory.py#L68-L73
because the previous check is now doing `isinstance(data, str)` instead of `isinstance(data, (str, bytes))`!
https://github.com/pytorch/pytorch/blob/c263bd43e8e8502d4726643bc6fd046f0130ac0e/torch/utils/data/_utils/pin_memory.py#L56-L57

As a result, `pin_memory` gets called recursively for each element in the `bytes` leading to a ton of wasted recursion. This also explains the slowdown / GIL contention I was seeing.

This PR simply changes `isinstance(data, str)` to `isinstance(data, (str, bytes))` to match the behavior before #94709

Pull Request resolved: #97737
Approved by: https://github.com/albanD, https://github.com/NivekT

* [DataLoader] Fix  collation logic (#97789)

Similar to #97737, a previous auto-refactor changed how `bytes` are handled during collation, which can potentially lead to performance regression. This PR undoes that.
Pull Request resolved: #97789
Approved by: https://github.com/albanD

* Revisit `torch._six.string_classes` removal (#94709) (#97863)

Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`.

Both `str` and `bytes` are `Sequence` classes.

```python
In [1]: from typing import Sequence

In [2]: issubclass(bytes, Sequence)
Out[2]: True

In [3]: issubclass(str, Sequence)
Out[3]: True
```

Re-add `bytes` to type guards like:

```python
def is_seq(obj):
    return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes))
```

Ref:

- #94709 (comment)
- #97737
- #97789
Pull Request resolved: #97863
Approved by: https://github.com/Skylion007, https://github.com/albanD

---------

Co-authored-by: Eric Zhang <[email protected]>
Co-authored-by: Kevin Tse <[email protected]>
@facebook-github-bot facebook-github-bot deleted the gh/nivekt/57/head branch June 8, 2023 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: dataloader release notes category topic: improvements topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants