-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Use aten's GRAIN_SIZE for TH Tensor ops #28770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm OK with merging this but I'm going this some more time due to uncertainty about how the old code works. |
|
I'm also Ok with merging, but hopefully DLRM regression will be solved even better by #27980 which will eliminate fill operation from mm/addmm completely. |
|
It looks like the thresholds were set based on this benchmark: Of the operations they tested, the fast ones that required high thresholds were labelled |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Fixes pytorch/pytorch#28198 in my tests on a 24 core AMD threadripper. Profiling the benchmark showed that most of the slowdown in pytorch/pytorch#28198 was from `THFloatTensor_fill` not being distributed across threads. It internally uses `TH_TENSOR_APPLY_CONTIG` which is a thin wrapper around `at::parallel_for` and uses `TH_OMP_OVERHEAD_THRESHOLD` or 100,000 as the grain size. Here I've changed it to use `at::internal::GRAIN_SIZE` which is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning in `ATen` will apply to `TH` as well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio to `TH_OMP_OVERHEAD_THRESHOLD` as before. Here are the timing results I get: | Version | Full iteration time | `index_select` | `mm` | `addmm` | |:----------:|---------------:|-------------:|---------:|---------:| | master | 3505.85 ms/it | 184.302 ms | 9.520 ms | 8.494 ms | | no scaling | 3453.18 ms/it | 184.456 ms | 5.810 ms | 5.069 ms | | this PR | 3453.23 ms/it | 184.526 ms | 5.824 ms | 5.202 ms | Pull Request resolved: pytorch/pytorch#28770 Differential Revision: D18202646 Pulled By: ezyang fbshipit-source-id: ab30e5ef24e62213f9bd3abace5c6442c75c9854
|
@peterbell10 it looks like this regressed some internal workload, so I'm unlanding it. Hopefully getting you more information soon. |
|
It was a false alarm, and we reverted the revert. |
Fixes #28198 in my tests on a 24 core AMD threadripper.
Profiling the benchmark showed that most of the slowdown in #28198 was from
THFloatTensor_fillnot being distributed across threads. It internally usesTH_TENSOR_APPLY_CONTIGwhich is a thin wrapper aroundat::parallel_forand usesTH_OMP_OVERHEAD_THRESHOLDor 100,000 as the grain size.Here I've changed it to use
at::internal::GRAIN_SIZEwhich is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning inATenwill apply toTHas well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio toTH_OMP_OVERHEAD_THRESHOLDas before.Here are the timing results I get:
index_selectmmaddmm