-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Partially parallelize randperm on CPU. #21529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This could use a more detailed description, if there is more context around this change. |
|
@jeffreyksmithjr Thanks, I updated the description. I don't have much more to say -- this is a pretty general performance improvement that is self-explanatory. |
VitalyFedyunin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, but consider:
- moving cpu code into the native/cpu folder, as it will enable AVX support.
- implement special code branch for the situation when
r__stride_0 == 1as it is going be almost always the case, and the kernel will work much faster.
|
@VitalyFedyunin Thanks, this is really good to know. I will try to look into how to make many CPU functions to the Do you happen to know where I can systematically learn these kinds of information for PyTorch internals? It would be quite helpful if such a document is available. |
|
We are working on better doc now, so far you can look at aten/src/ATen/native/cpu/README.md Also it is reasonable to attach benchmark methodology and results, especially if you are working on performance related tasks. |
4d0e5a0 to
f226768
Compare
|
@VitalyFedyunin I looked into the code again. I kind of what to change the implementation to make use of But I only did this for |
f226768 to
bbcf7d7
Compare
|
@VitalyFedyunin Do you mind if you could take a look at the recent commits and land the PR? Thanks. |
|
Sure you can, go ahead with it. Even previous state was fine. I even recommend to remove this ==1 check, as It is implemented incorrectly. What I meant by previous comment is something like: if (r__stride == 1) {
at::parallel_for(0, n, internal::GRAIN_SIZE,
[&r__data, &r__stride_0](int64_t p_begin, int64_t p_end) {
for(int64_t i = p_begin; i < p_end; i++)
r__data[i] = static_cast<scalar_t>(i);
});
}
else {
at::parallel_for(0, n, internal::GRAIN_SIZE,
[&r__data, &r__stride_0](int64_t p_begin, int64_t p_end) {
for(int64_t i = p_begin; i < p_end; i++)
r__data[i*r__stride_0] = static_cast<scalar_t>(i);
});
}
Which helps to avoid multiplication. But previous version with just: at::parallel_for(0, n, internal::GRAIN_SIZE,
[&r__data, &r__stride_0](int64_t p_begin, int64_t p_end) {
for(int64_t i = p_begin; i < p_end; i++)
r__data[i*r__stride_0] = static_cast<scalar_t>(i);
});is also good |
Please revert to previous state, without if check.
8698320 to
9f4cbd7
Compare
|
Sure, restored. |
9f4cbd7 to
3ce374a
Compare
3ce374a to
6345936
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: This commit parallelizes the variable initialization (from 1 to n) step on CPU. Pull Request resolved: pytorch/pytorch#21529 Differential Revision: D15855402 Pulled By: VitalyFedyunin fbshipit-source-id: f1ba54587451f9cb0eb5e542c3c5b458b48e1a3d
|
@VitalyFedyunin merged this pull request in 0702b5f. |
Summary: This commit parallelizes the variable initialization (from 1 to n) step on CPU. Pull Request resolved: pytorch#21529 Differential Revision: D15855402 Pulled By: VitalyFedyunin fbshipit-source-id: f1ba54587451f9cb0eb5e542c3c5b458b48e1a3d
This commit parallelizes the variable initialization (from 1 to n) step on CPU.