Use aten's GRAIN_SIZE for TH Tensor ops #28770

peterbell10 · 2019-10-28T00:42:02Z

Fixes #28198 in my tests on a 24 core AMD threadripper.

Profiling the benchmark showed that most of the slowdown in #28198 was from THFloatTensor_fill not being distributed across threads. It internally uses TH_TENSOR_APPLY_CONTIG which is a thin wrapper around at::parallel_for and uses TH_OMP_OVERHEAD_THRESHOLD or 100,000 as the grain size.

Here I've changed it to use at::internal::GRAIN_SIZE which is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning in ATen will apply to TH as well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio to TH_OMP_OVERHEAD_THRESHOLD as before.

Here are the timing results I get:

Version	Full iteration time	`index_select`	`mm`	`addmm`
master	3505.85 ms/it	184.302 ms	9.520 ms	8.494 ms
no scaling	3453.18 ms/it	184.456 ms	5.810 ms	5.069 ms
this PR	3453.23 ms/it	184.526 ms	5.824 ms	5.202 ms

ezyang · 2019-10-28T14:17:24Z

These thresholds were added in #5584 @zy97140 do you want to take a look here?

ezyang · 2019-10-28T14:17:58Z

I'm OK with merging this but I'm going this some more time due to uncertainty about how the old code works.

ngimel · 2019-10-28T15:11:17Z

I'm also Ok with merging, but hopefully DLRM regression will be solved even better by #27980 which will eliminate fill operation from mm/addmm completely.

peterbell10 · 2019-10-28T17:39:42Z

It looks like the thresholds were set based on this benchmark:
https://github.com/zy97140/omp-benchmark-for-pytorch

Of the operations they tested, the fast ones that required high thresholds were labelled ORDIN and the slower operations like sqrt and cos which have a lower threshold were labelled HYPER. UNCERTAIN means they didn't test it. I can't see why they've chosen those exact values though. The numbers don't appear anywhere in the repo's readme.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-10-31T14:37:29Z

@ezyang merged this pull request in fe88046.

Summary: Fixes pytorch/pytorch#28198 in my tests on a 24 core AMD threadripper. Profiling the benchmark showed that most of the slowdown in pytorch/pytorch#28198 was from `THFloatTensor_fill` not being distributed across threads. It internally uses `TH_TENSOR_APPLY_CONTIG` which is a thin wrapper around `at::parallel_for` and uses `TH_OMP_OVERHEAD_THRESHOLD` or 100,000 as the grain size. Here I've changed it to use `at::internal::GRAIN_SIZE` which is 32,768 so ~1/3 of the old value. I think it makes sense to unify these two values so any future tuning in `ATen` will apply to `TH` as well. It's not entirely clear to me what the "uncertain", "ordin" and "hyper" variants are meant to represent but I've kept them at roughly the same ratio to `TH_OMP_OVERHEAD_THRESHOLD` as before. Here are the timing results I get: | Version | Full iteration time | `index_select` | `mm` | `addmm` | |:----------:|---------------:|-------------:|---------:|---------:| | master | 3505.85 ms/it | 184.302 ms | 9.520 ms | 8.494 ms | | no scaling | 3453.18 ms/it | 184.456 ms | 5.810 ms | 5.069 ms | | this PR | 3453.23 ms/it | 184.526 ms | 5.824 ms | 5.202 ms | Pull Request resolved: pytorch/pytorch#28770 Differential Revision: D18202646 Pulled By: ezyang fbshipit-source-id: ab30e5ef24e62213f9bd3abace5c6442c75c9854

ezyang · 2019-10-31T20:48:00Z

@peterbell10 it looks like this regressed some internal workload, so I'm unlanding it. Hopefully getting you more information soon.

ezyang · 2019-11-06T19:47:18Z

It was a false alarm, and we reverted the revert.

Base TH_OMP_OVERHEAD on aten's GRAIN_SIZE

4d06f30

peterbell10 added the open source label Oct 28, 2019

peterbell10 requested a review from ezyang October 28, 2019 00:42

ezyang requested review from VitalyFedyunin and ngimel October 28, 2019 13:42

facebook-github-bot reviewed Oct 29, 2019

View reviewed changes

ifedan added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 29, 2019

facebook-github-bot closed this in fe88046 Oct 31, 2019

facebook-github-bot added the merged label Oct 31, 2019

Jianhui-Li mentioned this pull request Dec 5, 2019

Pytorch openmp thread number tuning option for CPU trainning #30803

Open

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use aten's GRAIN_SIZE for TH Tensor ops #28770

Use aten's GRAIN_SIZE for TH Tensor ops #28770

Uh oh!

peterbell10 commented Oct 28, 2019

Uh oh!

ezyang commented Oct 28, 2019

Uh oh!

ezyang commented Oct 28, 2019

Uh oh!

ngimel commented Oct 28, 2019

Uh oh!

peterbell10 commented Oct 28, 2019 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Oct 31, 2019

Uh oh!

ezyang commented Oct 31, 2019

Uh oh!

ezyang commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Use aten's GRAIN_SIZE for TH Tensor ops #28770

Use aten's GRAIN_SIZE for TH Tensor ops #28770

Uh oh!

Conversation

peterbell10 commented Oct 28, 2019

Uh oh!

ezyang commented Oct 28, 2019

Uh oh!

ezyang commented Oct 28, 2019

Uh oh!

ngimel commented Oct 28, 2019

Uh oh!

peterbell10 commented Oct 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 31, 2019

Uh oh!

ezyang commented Oct 31, 2019

Uh oh!

ezyang commented Nov 6, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

peterbell10 commented Oct 28, 2019 •

edited

Loading