Skip to content

Conversation

@xuhdev
Copy link
Collaborator

@xuhdev xuhdev commented Oct 22, 2019

This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):

import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
    for n, t in [(40_000, 600000),
                (400_000, 60000)]:
        print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
        print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))

Before:

a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847

After:

a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995

@xuhdev xuhdev requested review from ifedan, izdeby and nairbv and removed request for ifedan October 22, 2019 02:09
@xuhdev xuhdev force-pushed the speedup-fill branch 3 times, most recently from 7e383b3 to 87f31ea Compare October 22, 2019 20:20
@xuhdev xuhdev added module: cpu CPU specific problem (e.g., perf, algorithm) module: operators labels Oct 22, 2019
This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
    for n, t in [(40_000, 600000),
                (400_000, 60000)]:
        print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
        print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))
```

Before:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847
```

After:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995
```
@xuhdev
Copy link
Collaborator Author

xuhdev commented Oct 24, 2019

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Oct 24, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 5cf6441.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 24, 2019
Summary:
This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
    for n, t in [(40_000, 600000),
                (400_000, 60000)]:
        print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
        print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))
```

Before:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847
```

After:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995
```
Pull Request resolved: pytorch/pytorch#28397

Differential Revision: D18125171

Pulled By: ezyang

fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
This is done by replacing Vec<uint16_t> with Vec<int16_t>, which has all
sorts of AVX optimization available.

Benchmark (Debian 10, Release build, gcc 8.3, no turbo, Intel(R) Xeon(R) E-2136):

```python
import timeit
for dtype in ('torch.bfloat16', 'torch.half'):
    for n, t in [(40_000, 600000),
                (400_000, 60000)]:
        print(f'a.fill_(10) for {t} times, a=torch.empty({n}, dtype={dtype})')
        print(timeit.timeit(f'a.fill_(10)', setup=f'import torch; a=torch.empty({n}, dtype={dtype})', number=t))
```

Before:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
11.064065577999827
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
10.618151295000189
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
10.989039544000207
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
10.602233665999847
```

After:

```
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.bfloat16)
1.530125006000162
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.bfloat16)
1.4807136570002513
a.fill_(10) for 600000 times, a=torch.empty(40000, dtype=torch.half)
1.3946152990001792
a.fill_(10) for 60000 times, a=torch.empty(400000, dtype=torch.half)
1.457788402999995
```
Pull Request resolved: pytorch#28397

Differential Revision: D18125171

Pulled By: ezyang

fbshipit-source-id: bfb2da13f10bc582e9848073e428af9e36656b13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged module: cpu CPU specific problem (e.g., perf, algorithm)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants