Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Fixes #32008

This is similar to @CaoZhongZ's patch which runs on all OpenMP threads in the team and selectively exits early to scale the number of threads active. I have also restored the if clause from before #26963 so that running on 1 thread should still avoid additional synchronisation.

One comment is that this does slightly change the meaning of at::get_num_threads inside of a parallel_for loop since it's not guaranteed that the function was called on that many threads. I've looked at the uses within ATen and couldn't see anything that would be problematic. There are a few places in quantized that seem to make this assumption but they always use a grain size of 1 so should be safe:

const int num_tasks = at::get_num_threads();
at::parallel_for(0, num_tasks, 1, [&](int64_t begin, int64_t end) {

@ezyang
Copy link
Contributor

ezyang commented Jan 31, 2020

@VitalyFedyunin do you think you could take a look? I added some other folks who might be interested too.

@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 3, 2020
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ilia-cher ilia-cher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment for at::get_num_threads in ATen/Parallel.h
noting the behavior there? alternatively we could probably change get_num_threads to get_max_threads to emphasize the upper bound

@peterbell10
Copy link
Collaborator Author

@ilia-cher I've added the comment but noticed that at::get_num_threads actually calls omp_get_max_threads so was never guarunteed to be the actual number of thread launched anyway.

@dr-ci
Copy link

dr-ci bot commented Feb 7, 2020

💊 CircleCI build failures summary and remediations

As of commit f8fc17a:

Commit f8fc17a was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 2 times.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@VitalyFedyunin merged this pull request in 0808485.

ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Fixes pytorch#32008

This is similar to CaoZhongZ's patch which runs on all OpenMP threads in the team and selectively exits early to scale the number of threads active. I have also restored the `if` clause from before pytorch#26963 so that running on 1 thread should still avoid additional synchronisation.

One comment is that this does slightly change the meaning of `at::get_num_threads` inside of a `parallel_for` loop since it's not guaranteed that the function was called on that many threads. I've looked at the uses within ATen and couldn't see anything that would be problematic. There are a few places in `quantized` that seem to make this assumption but they always use a grain size of 1 so should be safe:
https://github.com/pytorch/pytorch/blob/d9e99ab544cceaf346605db1af4a862197a107cd/aten/src/ATen/native/quantized/cpu/qconv.cpp#L436-L437
Pull Request resolved: pytorch#32875

Differential Revision: D19775823

Pulled By: VitalyFedyunin

fbshipit-source-id: 4f843b78cdb9e2766339590d728923786a00af6d
}

int64_t tid = omp_get_thread_num();
int64_t chunk_size = divup((end - begin), omp_get_num_threads());
Copy link
Contributor

@imaginary-person imaginary-person Feb 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterbell10, this line in the original code was incorrect.
Since the work is to be distributed among num_threads threads, This line should have been
int64_t chunk_size = divup((end - begin), num_threads);

That's why @CaoZhongZ's workaround patch also delivered correct results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omp_get_num_threads() corresponds to the number of threads in the active parallel region. See these docs:
https://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fnum_005fthreads.html

Specifically note the reference to NUM_THREADS clause:

At runtime, the size of the current team may be set either by the NUM_THREADS clause or by omp_set_num_threads.

Copy link
Contributor

@imaginary-person imaginary-person Feb 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks a lot! I'm learning OpenMP & this is helping a lot. I mistook it for omp_get_max_threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory Leak & Performance Decrease on AMD CPU with PyTorch >1.3

7 participants