-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Reduce needless copying when returning lists of tensors in the JIT interpreter. #21690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
torch/csrc/jit/register_prim_ops.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does moving all the elements of a and b create a use-after-move issue if a or b is accessed later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for someone to have another reference to them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes it is. Add is a functional operation.
torch/csrc/jit/register_prim_ops.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yes it is. Add is a functional operation.
fbefa94 to
5b0e8b9
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
driazati
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though the resulting ops could be a little simpler by doing the dispatch on use_count at compile time by returning an Operation instead
|
This currently conflicts with #21170 which I think might have the same use-after-free bug as the original version of this... |
|
I assume you meant to link to the |
|
I'd rather have the optimization inside of the List class, then more code could potentially profit from it. Afaik, #21896 is already doing that? |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Is this going in soon? I have a PR to rebase past it. |
Summary: In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()` Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this: ``` Pytorch batch_gather took 0.018311 seconds. Pytorch batch_gather jit took 0.013921 seconds. Pytorch vectorized batch_gather took 0.001384 seconds. ``` Previously, `batch_gather jit` took 3x as long as `batch_gather`. Some logic taken from #21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`. Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor. Pull Request resolved: #21896 Differential Revision: D15998628 Pulled By: Chillee fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
Summary: In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()` Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this: ``` Pytorch batch_gather took 0.018311 seconds. Pytorch batch_gather jit took 0.013921 seconds. Pytorch vectorized batch_gather took 0.001384 seconds. ``` Previously, `batch_gather jit` took 3x as long as `batch_gather`. Some logic taken from pytorch/pytorch#21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`. Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor. Pull Request resolved: pytorch/pytorch#21896 Differential Revision: D15998628 Pulled By: Chillee fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@resistor has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
smessmer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, thanks
| for (T b_element : b) { | ||
| ret.push_back(std::move(b_element)); | ||
|
|
||
| if (a.use_count() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we should have a List::make_unshared_copy() that ensures that after calling it, use_count == 1 and does a copy if it needs to, or something for this. Feels like this could be useful in other ops as well.
|
This pull request has been merged in 7cc8f37. |
Summary: In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()` Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this: ``` Pytorch batch_gather took 0.018311 seconds. Pytorch batch_gather jit took 0.013921 seconds. Pytorch vectorized batch_gather took 0.001384 seconds. ``` Previously, `batch_gather jit` took 3x as long as `batch_gather`. Some logic taken from pytorch#21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`. Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor. Pull Request resolved: pytorch#21896 Differential Revision: D15998628 Pulled By: Chillee fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
…terpreter. (pytorch#21690) Summary: This fixes the JIT performance gap reported in https://twitter.com/VahidK/status/1138677898439561216 Pull Request resolved: pytorch#21690 Differential Revision: D15783709 fbshipit-source-id: 23bb4acda6b60c27e95667e1d53c7d261a87167d
This fixes the JIT performance gap reported in https://twitter.com/VahidK/status/1138677898439561216