Speed up threshold on CPU. #27155

xuhdev · 2019-10-01T19:14:27Z

This is a small fix, but the runtime improvement does seem consistent (a bit less than 10%):

Benchmark (no turbo, Release build, gcc 8.3, RHEL 7.7, Intel(R) Core(TM) i7-8850H):

import timeit

for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'):
    print(f'dtype={dtype}')
    for n, t in [(70_000, 200000),
                (700_000, 20000)]:
        print(f'torch.nn.Threshold(0.1, 20)(a), numel() == {n} for {t} times')
        print(timeit.timeit(f'm(a)', setup=f'import torch; m=torch.nn.Threshold(0.1, 20); a = torch.arange({n}, dtype={dtype})', number=t))

Before:

dtype=torch.double
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.88117562699972
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.525143070000013
dtype=torch.float
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.673380930000349
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.677610996000112
dtype=torch.int16
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
3.957677209999929
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
1.8512293700005102
dtype=torch.int32
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.624350482999944
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.670380037000541
dtype=torch.int64
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.86375758200029
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.468234717999621

After:

dtype=torch.double
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.64173036200009
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.456986365000375
dtype=torch.float
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.431988049000211
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.446968590000324
dtype=torch.int16
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
3.743787463999979
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
1.823233144000369
dtype=torch.int32
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
5.42801834400052
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
3.4600211680008215
dtype=torch.int64
torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times
8.562551314000302
torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times
9.37924196699987

This is a small fix, but the runtime improvement does seem consistent (a bit less than 10%): Benchmark (no turbo, gcc 8.3, RHEL 7.7, Intel(R) Core(TM) i7-8850H): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'): print(f'dtype={dtype}') for n, t in [(70_000, 200000), (700_000, 20000)]: print(f'torch.nn.Threshold(0.1, 20)(a), numel() == {n} for {t} times') print(timeit.timeit(f'm(a)', setup=f'import torch; m=torch.nn.Threshold(0.1, 20); a = torch.arange({n}, dtype={dtype})', number=t)) ``` Before: ``` dtype=torch.double torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.88117562699972 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.525143070000013 dtype=torch.float torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.673380930000349 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.677610996000112 dtype=torch.int16 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 3.957677209999929 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 1.8512293700005102 dtype=torch.int32 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.624350482999944 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.670380037000541 dtype=torch.int64 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.86375758200029 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.468234717999621 ``` After: ``` dtype=torch.double torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.64173036200009 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.456986365000375 dtype=torch.float torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.431988049000211 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.446968590000324 dtype=torch.int16 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 3.743787463999979 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 1.823233144000369 dtype=torch.int32 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.42801834400052 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.4600211680008215 dtype=torch.int64 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.562551314000302 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.37924196699987 ```

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

xuhdev · 2019-10-22T00:05:57Z

@pytorchbot merge this please

Summary: This is a small fix, but the runtime improvement does seem consistent (a bit less than 10%): Benchmark (no turbo, Release build, gcc 8.3, RHEL 7.7, Intel(R) Core(TM) i7-8850H): ```python import timeit for dtype in ('torch.double', 'torch.float', 'torch.int16', 'torch.int32', 'torch.int64'): print(f'dtype={dtype}') for n, t in [(70_000, 200000), (700_000, 20000)]: print(f'torch.nn.Threshold(0.1, 20)(a), numel() == {n} for {t} times') print(timeit.timeit(f'm(a)', setup=f'import torch; m=torch.nn.Threshold(0.1, 20); a = torch.arange({n}, dtype={dtype})', number=t)) ``` Before: ``` dtype=torch.double torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.88117562699972 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.525143070000013 dtype=torch.float torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.673380930000349 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.677610996000112 dtype=torch.int16 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 3.957677209999929 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 1.8512293700005102 dtype=torch.int32 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.624350482999944 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.670380037000541 dtype=torch.int64 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.86375758200029 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.468234717999621 ``` After: ``` dtype=torch.double torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.64173036200009 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.456986365000375 dtype=torch.float torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.431988049000211 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.446968590000324 dtype=torch.int16 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 3.743787463999979 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 1.823233144000369 dtype=torch.int32 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 5.42801834400052 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 3.4600211680008215 dtype=torch.int64 torch.nn.Threshold(0.1, 20)(a), numel() == 70000 for 200000 times 8.562551314000302 torch.nn.Threshold(0.1, 20)(a), numel() == 700000 for 20000 times 9.37924196699987 ``` Pull Request resolved: pytorch/pytorch#27155 Differential Revision: D17790768 Pulled By: VitalyFedyunin fbshipit-source-id: 3281eaff77ddddd658048c9e73824dd68c548591

facebook-github-bot · 2019-11-01T05:45:54Z

@VitalyFedyunin merged this pull request in 8a1f42b.

xuhdev requested review from colesbury and xiaomengy October 1, 2019 19:14

pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Oct 1, 2019

ezyang added the open source label Oct 1, 2019

soumith requested a review from VitalyFedyunin October 4, 2019 17:27

VitalyFedyunin approved these changes Oct 7, 2019

View reviewed changes

facebook-github-bot reviewed Oct 7, 2019

View reviewed changes

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Oct 22, 2019

facebook-github-bot closed this in 8a1f42b Nov 1, 2019

facebook-github-bot added the merged label Nov 1, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up threshold on CPU. #27155

Speed up threshold on CPU. #27155

Uh oh!

xuhdev commented Oct 1, 2019 •

edited

Loading

Uh oh!

facebook-github-bot left a comment

Uh oh!

xuhdev commented Oct 22, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Speed up threshold on CPU. #27155

Speed up threshold on CPU. #27155

Uh oh!

Conversation

xuhdev commented Oct 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

xuhdev commented Oct 22, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

xuhdev commented Oct 1, 2019 •

edited

Loading