Skip to content

Conversation

@colesbury
Copy link
Member

Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.

This deletes "s_copy_", "_s_copy_from", and "_copy_same_type_". The only
entry-point now is "copy_".

A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992

Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc

Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)

The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.

CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.

Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.

This deletes "s_copy_", "_s_copy_from", and "_copy_same_type_". The only
entry-point now is "copy_".

A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992

Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc

Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)

The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.

CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.
@pytorchbot pytorchbot added module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general module: operators oncall: quantization Quantization support in PyTorch labels May 19, 2019
Copy link
Contributor

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 21, 2019
Summary:
Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.

This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only
entry-point now is "`copy_`".

A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992

Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc

Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)

The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.

CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.
Pull Request resolved: pytorch/pytorch#20685

Differential Revision: D15414819

Pulled By: colesbury

fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440
@facebook-github-bot
Copy link
Contributor

@colesbury merged this pull request in 320c385.

@ezyang
Copy link
Contributor

ezyang commented May 21, 2019

This appears to have broken XLA CI, cc @ailzhang

@colesbury
Copy link
Member Author

@ezyang @ailzhang do we have an XLA contbuild?

@colesbury
Copy link
Member Author

Oh, the XLA contbuild "runs" on PRs, but only checks out code (not build or tests)...

@dlibenzi
Copy link
Contributor

This has broken XLA, in two ways.
One, a build issue, which we have tracked down and fixed:

pytorch/xla#697

But there is some more trouble in there.
When we do a variable.data.cpu() (where variable is an XLA variable), we end up with this stack trace:

#0  at::native::to_impl (self=..., options=..., non_blocking=false) at ../aten/src/ATen/native/TensorConversions.cpp:23
#1  0x00007fffe1fbe2d6 in at::native::to (self=..., options=..., non_blocking=false, copy=false) at ../aten/src/ATen/native/TensorConversions.cpp:55
#2  0x00007fffe23dfaaf in at::TypeDefault::to (this=0x5555571353f0, self=..., options=..., non_blocking=false, copy=false) at aten/src/ATen/TypeDefault.cpp:4251
#3  0x00007fffe48fb913 in torch::autograd::VariableType::to (this=0x5555571353f0, self=..., options=..., non_blocking=false, copy=false) at ../torch/csrc/autograd/generated/VariableType_2.cpp:16201
#4  0x00007fffe902dbfa in at::Tensor::to (this=0x7fff9454c490, options=..., non_blocking=false, copy=false) at ../aten/src/ATen/core/TensorMethods.h:817
#5  0x00007fffe9020074 in torch::autograd::dispatch_to (self=..., device=..., non_blocking=false, copy=false) at ../torch/csrc/autograd/generated/python_variable_methods.cpp:300
#6  0x00007fffe8f6434b in torch::autograd::THPVariable_cpu (self=0x7fff9454c480, args=0x0) at ../torch/csrc/autograd/generated/python_variable_methods.cpp:317

Essentially, variable.data is still a Variable, which ends up calling TypeDefault::to() (so even if we override them, it is useless), which ends up creating a CPU tensor and asking it to copy_() from an XLA tensor.
Which is not going to happen.

@colesbury
Copy link
Member Author

colesbury commented May 21, 2019

@dlibenzi I'm confused by the cpu()/to() code path you describe. The PR changes copy, but not cpu() or to(). The error in the code path you describe sounds like it happens before copy_ is called -- that to() is creating a CPU tensor instead of an XLA tensor. What am I missing?

@dlibenzi
Copy link
Contributor

I am not sure whether this PR changes are the culprit. Have not looked at the details.
I am guessing the removal of the s_copy* paths, dropped the hooks we used to make things right, WRT CPU type conversion.
We then end up with this crash:

https://pytorch.slack.com/files/UDK78U5JB/FJVMT2JSH/variable_data_cpy___crash_stack_trace.txt

Like described above, the to_impl() API ends up creating a CPU tensor, and asking it to copy_() an XLA tensor.


Tensor& _s_copy__cpu(Tensor& self, const Tensor& src, bool non_blocking) {
if (src.type_id() != CPUTensorId()) {
_s_copy_from(src, self, non_blocking);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this might be the path that broke XLA.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src was an XLA tensor

facebook-github-bot pushed a commit that referenced this pull request May 21, 2019
Summary:
Fixes #20755

(Was broken in #20685)

cc vadimkantorov
Pull Request resolved: #20759

Differential Revision: D15433712

Pulled By: colesbury

fbshipit-source-id: 29f612f7d4d7b73158d6f5dc1e46fd2f8fb09a2f
colesbury added a commit to colesbury/pytorch that referenced this pull request Jun 17, 2019
PR pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices.
facebook-github-bot pushed a commit that referenced this pull request Jun 18, 2019
Summary:
PR #20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices.  I didn't notice a difference on CUDA 10, but ngimel says it's
important for CUDA too.
Pull Request resolved: #21872

Differential Revision: D15863965

Pulled By: colesbury

fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 18, 2019
Summary:
PR pytorch/pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices.  I didn't notice a difference on CUDA 10, but ngimel says it's
important for CUDA too.
Pull Request resolved: pytorch/pytorch#21872

Differential Revision: D15863965

Pulled By: colesbury

fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: cuda Related to torch.cuda, and CUDA support in general oncall: quantization Quantization support in PyTorch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants