Speed up fill for half and bfloat16 on CPU. #28397

xuhdev · 2019-10-22T02:09:01Z

This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):

import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
    for n, t in [(40_000, 600000),
                (400_000, 60000)]:
        print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
        print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))

Before:

a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847

After:

a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995

aten/src/ATen/native/cpu/FillKernel.cpp

This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all sorts of AVX optimization available. Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136): ```python import timeit for dtype in ('torch.bfloat16', 'torch.half'): for n, t in [(40_000, 600000), (400_000, 60000)]: print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})') print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t)) ``` Before: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 11.064065577999827 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 10.618151295000189 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 10.989039544000207 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 10.602233665999847 ``` After: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 1.530125006000162 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 1.4807136570002513 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 1.3946152990001792 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 1.457788402999995 ```

xuhdev · 2019-10-24T21:20:56Z

@pytorchbot merge this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-10-24T22:38:22Z

@ezyang merged this pull request in 5cf6441.

Summary: This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all sorts of AVX optimization available. Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136): ```python import timeit for dtype in ('torch.bfloat16', 'torch.half'): for n, t in [(40_000, 600000), (400_000, 60000)]: print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})') print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t)) ``` Before: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 11.064065577999827 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 10.618151295000189 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 10.989039544000207 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 10.602233665999847 ``` After: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 1.530125006000162 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 1.4807136570002513 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 1.3946152990001792 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 1.457788402999995 ``` Pull Request resolved: pytorch/pytorch#28397 Differential Revision: D18125171 Pulled By: ezyang fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13

Summary: This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all sorts of AVX optimization available. Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136): ```python import timeit for dtype in ('torch.bfloat16', 'torch.half'): for n, t in [(40_000, 600000), (400_000, 60000)]: print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})') print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t)) ``` Before: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 11.064065577999827 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 10.618151295000189 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 10.989039544000207 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 10.602233665999847 ``` After: ``` a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16) 1.530125006000162 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16) 1.4807136570002513 a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half) 1.3946152990001792 a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half) 1.457788402999995 ``` Pull Request resolved: pytorch#28397 Differential Revision: D18125171 Pulled By: ezyang fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13

xuhdev requested review from ifedan, izdeby and nairbv and removed request for ifedan October 22, 2019 02:09

nairbv reviewed Oct 22, 2019

View reviewed changes

aten/src/ATen/native/cpu/FillKernel.cpp Outdated Show resolved Hide resolved

xuhdev force-pushed the speedup-fill branch from 35384f7 to 35ca155 Compare October 22, 2019 18:16

xuhdev requested a review from nairbv October 22, 2019 18:17

nairbv approved these changes Oct 22, 2019

View reviewed changes

xuhdev force-pushed the speedup-fill branch 3 times, most recently from 7e383b3 to 87f31ea Compare October 22, 2019 20:20

xuhdev added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Oct 22, 2019

xuhdev force-pushed the speedup-fill branch from 87f31ea to c447f61 Compare October 23, 2019 00:00

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Oct 24, 2019

facebook-github-bot reviewed Oct 24, 2019

View reviewed changes

facebook-github-bot closed this in 5cf6441 Oct 24, 2019

facebook-github-bot added the merged label Oct 24, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up fill for half and bfloat16 on CPU. #28397

Speed up fill for half and bfloat16 on CPU. #28397

Uh oh!

xuhdev commented Oct 22, 2019

Uh oh!

Uh oh!

xuhdev commented Oct 24, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Oct 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Speed up fill for half and bfloat16 on CPU. #28397

Speed up fill for half and bfloat16 on CPU. #28397

Uh oh!

Conversation

xuhdev commented Oct 22, 2019

Uh oh!

Uh oh!

xuhdev commented Oct 24, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants