Skip to content

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Jun 7, 2019

This commit parallelizes the variable initialization (from 1 to n) step on CPU.

@xuhdev xuhdev force-pushed the parallel/randperm branch from 6520020 to 25ea431 Compare June 7, 2019 18:42
@xuhdev xuhdev changed the title Partially parallelize randperm. Partially parallelize randperm on CPU. Jun 7, 2019
@jeffreyksmithjr
Copy link
Contributor

This could use a more detailed description, if there is more context around this change.

@jeffreyksmithjr jeffreyksmithjr added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 10, 2019
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 10, 2019

@jeffreyksmithjr Thanks, I updated the description. I don't have much more to say -- this is a pretty general performance improvement that is self-explanatory.

VitalyFedyunin
VitalyFedyunin previously approved these changes Jun 11, 2019
Copy link
Contributor

@VitalyFedyunin VitalyFedyunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, but consider:

  1. moving cpu code into the native/cpu folder, as it will enable AVX support.
  2. implement special code branch for the situation when r__stride_0 == 1 as it is going be almost always the case, and the kernel will work much faster.

@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 11, 2019

@VitalyFedyunin Thanks, this is really good to know. I will try to look into how to make many CPU functions to the native/cpu folder.

Do you happen to know where I can systematically learn these kinds of information for PyTorch internals? It would be quite helpful if such a document is available.

@VitalyFedyunin
Copy link
Contributor

We are working on better doc now, so far you can look at aten/src/ATen/native/cpu/README.md

Also it is reasonable to attach benchmark methodology and results, especially if you are working on performance related tasks.

@xuhdev xuhdev force-pushed the parallel/randperm branch 2 times, most recently from 4d0e5a0 to f226768 Compare June 13, 2019 23:15
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 13, 2019

@VitalyFedyunin I looked into the code again. I kind of what to change the implementation to make use of arange_out instead of implementing once again, and leave all possible future optimization (if it will ever happen) to arange_out. I have changed the current code to follow this thought. What do you think?

But I only did this for stride == 1 -- I looked into the code of range_cpu_out, and it seems to assume that stride is always 1. I'm a bit confused: Is it same to assume stride to be 1, or arange_out didn't do it correctly?

@xuhdev xuhdev force-pushed the parallel/randperm branch from f226768 to bbcf7d7 Compare June 14, 2019 00:17
@zhangguanheng66
Copy link
Contributor

@VitalyFedyunin Do you mind if you could take a look at the recent commits and land the PR? Thanks.

@VitalyFedyunin
Copy link
Contributor

Sure you can, go ahead with it. Even previous state was fine. I even recommend to remove this ==1 check, as It is implemented incorrectly. What I meant by previous comment is something like:

if (r__stride == 1) {
 at::parallel_for(0, n, internal::GRAIN_SIZE,
                    [&r__data, &r__stride_0](int64_t p_begin, int64_t p_end) {
      for(int64_t i = p_begin; i < p_end; i++)
        r__data[i] = static_cast<scalar_t>(i);
    }); 
}
else {
 at::parallel_for(0, n, internal::GRAIN_SIZE,
                    [&r__data, &r__stride_0](int64_t p_begin, int64_t p_end) {
      for(int64_t i = p_begin; i < p_end; i++)
        r__data[i*r__stride_0] = static_cast<scalar_t>(i);
    });
}

Which helps to avoid multiplication.

But previous version with just:

 at::parallel_for(0, n, internal::GRAIN_SIZE,
                    [&r__data, &r__stride_0](int64_t p_begin, int64_t p_end) {
      for(int64_t i = p_begin; i < p_end; i++)
        r__data[i*r__stride_0] = static_cast<scalar_t>(i);
    });

is also good

@VitalyFedyunin VitalyFedyunin dismissed their stale review June 14, 2019 18:27

Please revert to previous state, without if check.

@xuhdev xuhdev force-pushed the parallel/randperm branch 2 times, most recently from 8698320 to 9f4cbd7 Compare June 14, 2019 19:42
@xuhdev
Copy link
Collaborator Author

xuhdev commented Jun 14, 2019

Sure, restored.

@xuhdev xuhdev force-pushed the parallel/randperm branch from 9f4cbd7 to 3ce374a Compare June 14, 2019 22:45
@xuhdev xuhdev force-pushed the parallel/randperm branch from 3ce374a to 6345936 Compare June 15, 2019 16:21
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@umanwizard
Copy link
Contributor

@pytorchbot rebase this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@xuhdev xuhdev deleted the parallel/randperm branch June 20, 2019 18:18
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 20, 2019
Summary:
This commit parallelizes the variable initialization (from 1 to n) step on CPU.
Pull Request resolved: pytorch/pytorch#21529

Differential Revision: D15855402

Pulled By: VitalyFedyunin

fbshipit-source-id: f1ba54587451f9cb0eb5e542c3c5b458b48e1a3d
@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 0702b5f.

iotamudelta pushed a commit to ROCm/pytorch that referenced this pull request Jun 21, 2019
Summary:
This commit parallelizes the variable initialization (from 1 to n) step on CPU.
Pull Request resolved: pytorch#21529

Differential Revision: D15855402

Pulled By: VitalyFedyunin

fbshipit-source-id: f1ba54587451f9cb0eb5e542c3c5b458b48e1a3d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants