Improving BinaryOpsKernel.cu #29428

zasdfgbnm · 2019-11-08T00:51:26Z

Building BinaryOpsKernel.cu takes extremely long. Split the original file into 3 pieces, and copy-paste code into these files.
Remove some useless logic
change some wrong ops name *_cpu -> *_cuda

zasdfgbnm · 2019-11-08T04:18:52Z

aten/src/ATen/native/cuda/BinaryOpsKernel.cu

-}
-
-void logical_xor_kernel_cuda(TensorIterator& iter) {
-  if (iter.common_dtype() == ScalarType::Bool) {


This logic is useless

Seems like so. Perhaps deleting this logical for all comparison ops will speed up sufficiently without the need to split.

@xuhdev Even splitted, BinaryCompareKernel.cu still takes 2min 44s to compile

If every function in BinaryCompareKernel.cu is cut into half, then it may be reduced down to a reasonable time. I believe the functions in BinaryCompareKernel.cu might be the bottleneck.

Hmm, this PR already cut it into half, but it still takes more than 2 minutes...

zasdfgbnm · 2019-11-08T04:29:21Z

aten/src/ATen/native/cuda/BinaryOpsKernel.cu

-
-void lt_kernel_cuda(TensorIterator& iter) {
-  if (iter.common_dtype() == ScalarType::Bool) {
-    AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBool, iter.input_dtype(), "lt_cpu", [&]() {


should be lt_cuda

zasdfgbnm · 2019-11-08T05:31:41Z

aten/src/ATen/native/cuda/BinaryOpsKernel.cu

-}
-
-void lt_kernel_cuda(TensorIterator& iter) {
-  if (iter.common_dtype() == ScalarType::Bool) {


This logic is useless either. With the dynamic casting approach in TensorIterator, it always does the computation in common dtype and stores the result as the common dtype and then dynamically cast it into bool.

zasdfgbnm · 2019-11-08T06:44:47Z

@ngimel @VitalyFedyunin Could you please take a look at this? You reviewed the dynamic casting of TensorIterator.

VitalyFedyunin

Would be nice to add benchmark results for changed operators like logical_xor_kernel_cuda

wrong button pressed

zasdfgbnm · 2019-11-08T22:44:54Z

@VitalyFedyunin Benchmarks shows the performance change very little:

import torch
print(torch.__version__)
print(torch.version.git_version)
print()
print('=' * 20)


for size in [10, 1000000, 100000000]:
    for dtype in [torch.float, torch.bool]:
        print('size:', size, ', dtype:', dtype)
        a = torch.randn(size, device='cuda').to(dtype)
        print('compare ops')
        torch.cuda.synchronize()
        %timeit a < a; torch.cuda.synchronize()
        print('logical_xor')
        torch.cuda.synchronize()
        %timeit torch.logical_xor(a, a); torch.cuda.synchronize()
        print()
    print('-' * 20)

before

1.4.0a0+1dd3c8e
1dd3c8e53909d6cf35ade5cf85cd7430e5c655f9

====================
size: 10 , dtype: torch.float32
compare ops
20.9 µs ± 315 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
20.2 µs ± 133 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

size: 10 , dtype: torch.bool
compare ops
21 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
19.8 µs ± 97.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

--------------------
size: 1000000 , dtype: torch.float32
compare ops
23.7 µs ± 421 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
23.9 µs ± 296 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

size: 1000000 , dtype: torch.bool
compare ops
20.7 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
21.3 µs ± 190 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

--------------------
size: 100000000 , dtype: torch.float32
compare ops
709 µs ± 354 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
logical_xor
713 µs ± 128 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

size: 100000000 , dtype: torch.bool
compare ops
471 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
logical_xor
471 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

--------------------

after

1.4.0a0+309f6d6
309f6d6a9c53e9c0c091e8ea809b7582af9d185d

====================
size: 10 , dtype: torch.float32
compare ops
20.4 µs ± 59.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
19 µs ± 59.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

size: 10 , dtype: torch.bool
compare ops
20.5 µs ± 94.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
19 µs ± 54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

--------------------
size: 1000000 , dtype: torch.float32
compare ops
23.6 µs ± 218 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
23.1 µs ± 66.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

size: 1000000 , dtype: torch.bool
compare ops
20.3 µs ± 40.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
logical_xor
20.5 µs ± 74 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

--------------------
size: 100000000 , dtype: torch.float32
compare ops
707 µs ± 212 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
logical_xor
712 µs ± 533 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

size: 100000000 , dtype: torch.bool
compare ops
472 µs ± 283 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
logical_xor
472 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

--------------------

facebook-github-bot

@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zasdfgbnm · 2019-11-11T22:07:21Z

@VitalyFedyunin Is the internal failure real?

Summary: - Building `BinaryOpsKernel.cu` takes extremely long. Split the original file into 3 pieces, and copy-paste code into these files. - Remove some useless logic - change some wrong ops name `*_cpu` -> `*_cuda` Pull Request resolved: pytorch/pytorch#29428 Differential Revision: D18408858 Pulled By: VitalyFedyunin fbshipit-source-id: 29323b0bc40a928ae698345ad1ffe46c5851b012

facebook-github-bot · 2019-11-12T08:09:24Z

@VitalyFedyunin merged this pull request in 01ad2bc.

zasdfgbnm added 2 commits November 7, 2019 16:50

Building BinaryOpsKernel.cu takes a long time, parallelize it

7fb69de

empty line

dff2bc2

zasdfgbnm changed the title ~~Building BinaryOpsKernel.cu takes a long time, parallelize it~~ [WIP] Building BinaryOpsKernel.cu takes a long time, parallelize it Nov 8, 2019

fix

4ff6e7c

zasdfgbnm changed the title ~~[WIP] Building BinaryOpsKernel.cu takes a long time, parallelize it~~ Improving BinaryOpsKernel.cu Nov 8, 2019

zasdfgbnm changed the title ~~Improving BinaryOpsKernel.cu~~ [WIP] Improving BinaryOpsKernel.cu Nov 8, 2019

Some logic simplification

309f6d6

zasdfgbnm commented Nov 8, 2019

View reviewed changes

zasdfgbnm changed the title ~~[WIP] Improving BinaryOpsKernel.cu~~ Improving BinaryOpsKernel.cu Nov 8, 2019

zasdfgbnm requested review from VitalyFedyunin and ngimel November 8, 2019 06:45

VitalyFedyunin previously approved these changes Nov 8, 2019

View reviewed changes

VitalyFedyunin approved these changes Nov 8, 2019

View reviewed changes

facebook-github-bot reviewed Nov 8, 2019

View reviewed changes

facebook-github-bot closed this in 01ad2bc Nov 11, 2019

zasdfgbnm deleted the split branch November 11, 2019 22:48

bddppq mentioned this pull request Nov 11, 2019

[ROCm] Missing host device #29547

Closed

facebook-github-bot added the merged label Nov 12, 2019

zasdfgbnm mentioned this pull request Nov 20, 2019

Use "_cuda" dispatch name postfix in BinaryOpsKernel.cu #27952

Closed

mruberry added the Merged label Oct 28, 2020

janeyx99 mentioned this pull request Nov 20, 2023

Consider adding y/x -> y * 1/x optimization for _foreach_div_.ScalarList and other div Scalar overloads #114165

Closed

3 tasks

Improving BinaryOpsKernel.cu #29428

Improving BinaryOpsKernel.cu #29428

Uh oh!

Conversation

zasdfgbnm commented Nov 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

xuhdev Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

xuhdev Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Nov 8, 2019

Uh oh!

VitalyFedyunin left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Nov 8, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Nov 12, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zasdfgbnm commented Nov 8, 2019 •

edited

Loading

zasdfgbnm commented Nov 11, 2019 •

edited

Loading