-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Workaround performance bug / memory leak in GOMP #32875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@VitalyFedyunin do you think you could take a look? I added some other folks who might be interested too. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
ilia-cher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment for at::get_num_threads in ATen/Parallel.h
noting the behavior there? alternatively we could probably change get_num_threads to get_max_threads to emphasize the upper bound
|
@ilia-cher I've added the comment but noticed that |
0c7397d to
e050d85
Compare
💊 CircleCI build failures summary and remediationsAs of commit f8fc17a: Commit f8fc17a was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 2 times. |
e050d85 to
f8fc17a
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@VitalyFedyunin merged this pull request in 0808485. |
Summary: Fixes pytorch#32008 This is similar to CaoZhongZ's patch which runs on all OpenMP threads in the team and selectively exits early to scale the number of threads active. I have also restored the `if` clause from before pytorch#26963 so that running on 1 thread should still avoid additional synchronisation. One comment is that this does slightly change the meaning of `at::get_num_threads` inside of a `parallel_for` loop since it's not guaranteed that the function was called on that many threads. I've looked at the uses within ATen and couldn't see anything that would be problematic. There are a few places in `quantized` that seem to make this assumption but they always use a grain size of 1 so should be safe: https://github.com/pytorch/pytorch/blob/d9e99ab544cceaf346605db1af4a862197a107cd/aten/src/ATen/native/quantized/cpu/qconv.cpp#L436-L437 Pull Request resolved: pytorch#32875 Differential Revision: D19775823 Pulled By: VitalyFedyunin fbshipit-source-id: 4f843b78cdb9e2766339590d728923786a00af6d
| } | ||
|
|
||
| int64_t tid = omp_get_thread_num(); | ||
| int64_t chunk_size = divup((end - begin), omp_get_num_threads()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterbell10, this line in the original code was incorrect.
Since the work is to be distributed among num_threads threads, This line should have been
int64_t chunk_size = divup((end - begin), num_threads);
That's why @CaoZhongZ's workaround patch also delivered correct results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
omp_get_num_threads() corresponds to the number of threads in the active parallel region. See these docs:
https://gcc.gnu.org/onlinedocs/libgomp/omp_005fget_005fnum_005fthreads.html
Specifically note the reference to NUM_THREADS clause:
At runtime, the size of the current team may be set either by the NUM_THREADS clause or by omp_set_num_threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks a lot! I'm learning OpenMP & this is helping a lot. I mistook it for omp_get_max_threads.
Fixes #32008
This is similar to @CaoZhongZ's patch which runs on all OpenMP threads in the team and selectively exits early to scale the number of threads active. I have also restored the
ifclause from before #26963 so that running on 1 thread should still avoid additional synchronisation.One comment is that this does slightly change the meaning of
at::get_num_threadsinside of aparallel_forloop since it's not guaranteed that the function was called on that many threads. I've looked at the uses within ATen and couldn't see anything that would be problematic. There are a few places inquantizedthat seem to make this assumption but they always use a grain size of 1 so should be safe:pytorch/aten/src/ATen/native/quantized/cpu/qconv.cpp
Lines 436 to 437 in d9e99ab