Refactor CUDA copy and general copy dispatch #20685

colesbury · 2019-05-19T00:15:42Z

Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.

This deletes "s_copy_", "_s_copy_from", and "_copy_same_type_". The only
entry-point now is "copy_".

A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992

Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc

Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)

The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.

CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.

Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "s_copy_", "_s_copy_from", and "_copy_same_type_". The only entry-point now is "copy_". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that.

aten/src/ATen/native/Copy.cpp

soumith

lgtm

facebook-github-bot

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only entry-point now is "`copy_`". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that. Pull Request resolved: pytorch/pytorch#20685 Differential Revision: D15414819 Pulled By: colesbury fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440

facebook-github-bot · 2019-05-21T02:40:46Z

@colesbury merged this pull request in 320c385.

ezyang · 2019-05-21T13:45:02Z

This appears to have broken XLA CI, cc @ailzhang

colesbury · 2019-05-21T15:35:32Z

@ezyang @ailzhang do we have an XLA contbuild?

colesbury · 2019-05-21T16:08:08Z

Oh, the XLA contbuild "runs" on PRs, but only checks out code (not build or tests)...

dlibenzi · 2019-05-21T17:28:52Z

This has broken XLA, in two ways.
One, a build issue, which we have tracked down and fixed:

pytorch/xla#697

But there is some more trouble in there.
When we do a variable.data.cpu() (where variable is an XLA variable), we end up with this stack trace:

#0  at::native::to_impl (self=..., options=..., non_blocking=false) at ../aten/src/ATen/native/TensorConversions.cpp:23
#1  0x00007fffe1fbe2d6 in at::native::to (self=..., options=..., non_blocking=false, copy=false) at ../aten/src/ATen/native/TensorConversions.cpp:55
#2  0x00007fffe23dfaaf in at::TypeDefault::to (this=0x5555571353f0, self=..., options=..., non_blocking=false, copy=false) at aten/src/ATen/TypeDefault.cpp:4251
#3  0x00007fffe48fb913 in torch::autograd::VariableType::to (this=0x5555571353f0, self=..., options=..., non_blocking=false, copy=false) at ../torch/csrc/autograd/generated/VariableType_2.cpp:16201
#4  0x00007fffe902dbfa in at::Tensor::to (this=0x7fff9454c490, options=..., non_blocking=false, copy=false) at ../aten/src/ATen/core/TensorMethods.h:817
#5  0x00007fffe9020074 in torch::autograd::dispatch_to (self=..., device=..., non_blocking=false, copy=false) at ../torch/csrc/autograd/generated/python_variable_methods.cpp:300
#6  0x00007fffe8f6434b in torch::autograd::THPVariable_cpu (self=0x7fff9454c480, args=0x0) at ../torch/csrc/autograd/generated/python_variable_methods.cpp:317

Essentially, variable.data is still a Variable, which ends up calling TypeDefault::to() (so even if we override them, it is useless), which ends up creating a CPU tensor and asking it to copy_() from an XLA tensor.
Which is not going to happen.

colesbury · 2019-05-21T17:45:04Z

@dlibenzi I'm confused by the cpu()/to() code path you describe. The PR changes copy, but not cpu() or to(). The error in the code path you describe sounds like it happens before copy_ is called -- that to() is creating a CPU tensor instead of an XLA tensor. What am I missing?

dlibenzi · 2019-05-21T17:51:34Z

I am not sure whether this PR changes are the culprit. Have not looked at the details.
I am guessing the removal of the s_copy* paths, dropped the hooks we used to make things right, WRT CPU type conversion.
We then end up with this crash:

https://pytorch.slack.com/files/UDK78U5JB/FJVMT2JSH/variable_data_cpy___crash_stack_trace.txt

Like described above, the to_impl() API ends up creating a CPU tensor, and asking it to copy_() an XLA tensor.

dlibenzi · 2019-05-21T18:24:59Z

aten/src/ATen/native/Copy.cpp


-Tensor& _s_copy__cpu(Tensor& self, const Tensor& src, bool non_blocking) {
-  if (src.type_id() != CPUTensorId()) {
-    _s_copy_from(src, self, non_blocking);


I believe this might be the path that broke XLA.

Was self an XLA tensor? or did aten_xla_type hack it to work with copying into a CPU tensor?

https://github.com/pytorch/xla/blob/b7c0ab3f8cd63e490ed03ecdbef7b39ea069b8fc/torch_xla/csrc/aten_xla_bridge.cpp#L104

src was an XLA tensor

Summary: Fixes #20755 (Was broken in #20685) cc vadimkantorov Pull Request resolved: #20759 Differential Revision: D15433712 Pulled By: colesbury fbshipit-source-id: 29f612f7d4d7b73158d6f5dc1e46fd2f8fb09a2f

PR pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies. This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm devices.

Summary: PR #20685 incorrectly only enabled P2P access for non-contiguous copies. This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm devices. I didn't notice a difference on CUDA 10, but ngimel says it's important for CUDA too. Pull Request resolved: #21872 Differential Revision: D15863965 Pulled By: colesbury fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1

Summary: PR pytorch/pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies. This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm devices. I didn't notice a difference on CUDA 10, but ngimel says it's important for CUDA too. Pull Request resolved: pytorch/pytorch#21872 Differential Revision: D15863965 Pulled By: colesbury fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1

pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: operators oncall: quantization Quantization support in PyTorch labels May 19, 2019

Add undefined checks in copy_

ad5429f

ssnl reviewed May 19, 2019

View reviewed changes

aten/src/ATen/native/Copy.cpp Outdated Show resolved Hide resolved

Fix typo

2dae5b2

soumith approved these changes May 19, 2019

View reviewed changes

facebook-github-bot reviewed May 20, 2019

View reviewed changes

facebook-github-bot closed this in 320c385 May 21, 2019

facebook-github-bot added the merged label May 21, 2019

colesbury mentioned this pull request May 21, 2019

Fix copy_transpose_valid check #20759

Closed

dlibenzi reviewed May 21, 2019

View reviewed changes

colesbury mentioned this pull request Jun 17, 2019

Always enable P2P access for GPU copies #21872

Closed

fmassa mentioned this pull request Jul 10, 2019

CUDA optimization of THCTensorMathPointwise.cuh #22612

Closed

mruberry added the Merged label Oct 28, 2020

Refactor CUDA copy and general copy dispatch #20685

Refactor CUDA copy and general copy dispatch #20685

Uh oh!

Conversation

colesbury commented May 19, 2019

Uh oh!

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 21, 2019

Uh oh!

ezyang commented May 21, 2019

Uh oh!

colesbury commented May 21, 2019

Uh oh!

colesbury commented May 21, 2019

Uh oh!

dlibenzi commented May 21, 2019

Uh oh!

colesbury commented May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlibenzi commented May 21, 2019

Uh oh!

dlibenzi May 21, 2019

Choose a reason for hiding this comment

Uh oh!

colesbury May 21, 2019

Choose a reason for hiding this comment

Uh oh!

dlibenzi May 21, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

colesbury commented May 21, 2019 •

edited

Loading