Skip to content

Conversation

@peterbell10
Copy link
Collaborator

Fixes #28198 in my tests on a 24 core AMD threadripper.

Profiling the benchmark showed that most of the slowdown in #28198 was from THFloatTensor_fill not being distributed across threads. It internally uses TH_TENSOR_APPLY_CONTIG which is a thin wrapper around at::parallel_for and uses TH_OMP_OVERHEAD_THRESHOLD or 100,000 as the grain size.

Here I've changed it to use at::internal::GRAIN_SIZE which is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning in ATen will apply to TH as well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio to TH_OMP_OVERHEAD_THRESHOLD as before.

Here are the timing results I get:

Version Full iteration time index_select mm addmm
master 3505.85 ms/it 184.302 ms 9.520 ms 8.494 ms
no scaling 3453.18 ms/it 184.456 ms 5.810 ms 5.069 ms
this PR 3453.23 ms/it 184.526 ms 5.824 ms 5.202 ms

@ezyang
Copy link
Contributor

ezyang commented Oct 28, 2019

These thresholds were added in #5584 @zy97140 do you want to take a look here?

@ezyang
Copy link
Contributor

ezyang commented Oct 28, 2019

I'm OK with merging this but I'm going this some more time due to uncertainty about how the old code works.

@ngimel
Copy link
Collaborator

ngimel commented Oct 28, 2019

I'm also Ok with merging, but hopefully DLRM regression will be solved even better by #27980 which will eliminate fill operation from mm/addmm completely.

@peterbell10
Copy link
Collaborator Author

peterbell10 commented Oct 28, 2019

It looks like the thresholds were set based on this benchmark:
https://github.com/zy97140/omp-benchmark-for-pytorch

Of the operations they tested, the fast ones that required high thresholds were labelled ORDIN and the slower operations like sqrt and cos which have a lower threshold were labelled HYPER. UNCERTAIN means they didn't test it. I can't see why they've chosen those exact values though. The numbers don't appear anywhere in the repo's readme.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ifedan ifedan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 29, 2019
@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in fe88046.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 31, 2019
Summary:
Fixes pytorch/pytorch#28198 in my tests on a 24 core AMD threadripper.

Profiling the benchmark showed that most of the slowdown in pytorch/pytorch#28198 was from `THFloatTensor_fill` not being distributed across threads. It internally uses `TH_TENSOR_APPLY_CONTIG` which is a thin wrapper around `at::parallel_for` and uses `TH_OMP_OVERHEAD_THRESHOLD` or 100,000 as the grain size.

Here I've changed it to use `at::internal::GRAIN_SIZE` which is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning in `ATen` will apply to `TH` as well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio to `TH_OMP_OVERHEAD_THRESHOLD` as before.

Here are the timing results I get:

| Version    | Full iteration time | `index_select` | `mm`       | `addmm`    |
|:----------:|---------------:|-------------:|---------:|---------:|
| master     | 3505.85 ms/it  | 184.302 ms   | 9.520 ms | 8.494 ms |
| no scaling | 3453.18 ms/it  |   184.456 ms | 5.810 ms | 5.069 ms |
| this PR    | 3453.23 ms/it  |   184.526 ms | 5.824 ms | 5.202 ms |
Pull Request resolved: pytorch/pytorch#28770

Differential Revision: D18202646

Pulled By: ezyang

fbshipit-source-id: ab30e5ef24e62213f9bd3abace5c6442c75c9854
@ezyang
Copy link
Contributor

ezyang commented Oct 31, 2019

@peterbell10 it looks like this regressed some internal workload, so I'm unlanding it. Hopefully getting you more information soon.

@ezyang
Copy link
Contributor

ezyang commented Nov 6, 2019

It was a false alarm, and we reverted the revert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DLRM performance regression on #26963

6 participants