-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Refactor CUDA copy and general copy dispatch #20685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "s_copy_", "_s_copy_from", and "_copy_same_type_". The only entry-point now is "copy_". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that.
soumith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Copy.cu goes from 308 to 190 lines of code. In general it uses, the same copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy using temporary buffers. The pointwise kernel has slightly improved performance when broadcasting due to faster index calculation. This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only entry-point now is "`copy_`". A mini-benchmark is here: https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992 Before: https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18 After: https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc Results were measured on 2.2 GHz Broadwell; no-turbo; one thread; compiled with GCC 7.3.0. (Results are slower than typical usage due to turbo being off.) The only significant differences is in the CUDA [1024] -> [1024, 1024] broadcasting copy which is ~25% faster. I don't expect a noticeable difference in real programs. CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect anyone to notice that. Pull Request resolved: pytorch/pytorch#20685 Differential Revision: D15414819 Pulled By: colesbury fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440
|
@colesbury merged this pull request in 320c385. |
|
This appears to have broken XLA CI, cc @ailzhang |
|
Oh, the XLA contbuild "runs" on PRs, but only checks out code (not build or tests)... |
|
This has broken XLA, in two ways. But there is some more trouble in there. Essentially, |
|
@dlibenzi I'm confused by the cpu()/to() code path you describe. The PR changes copy, but not cpu() or to(). The error in the code path you describe sounds like it happens before copy_ is called -- that to() is creating a CPU tensor instead of an XLA tensor. What am I missing? |
|
I am not sure whether this PR changes are the culprit. Have not looked at the details. Like described above, the |
|
|
||
| Tensor& _s_copy__cpu(Tensor& self, const Tensor& src, bool non_blocking) { | ||
| if (src.type_id() != CPUTensorId()) { | ||
| _s_copy_from(src, self, non_blocking); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this might be the path that broke XLA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was self an XLA tensor? or did aten_xla_type hack it to work with copying into a CPU tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src was an XLA tensor
PR pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies. This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm devices.
Summary: PR #20685 incorrectly only enabled P2P access for non-contiguous copies. This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm devices. I didn't notice a difference on CUDA 10, but ngimel says it's important for CUDA too. Pull Request resolved: #21872 Differential Revision: D15863965 Pulled By: colesbury fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1
Summary: PR pytorch/pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies. This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm devices. I didn't notice a difference on CUDA 10, but ngimel says it's important for CUDA too. Pull Request resolved: pytorch/pytorch#21872 Differential Revision: D15863965 Pulled By: colesbury fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1
Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.
This deletes "
s_copy_", "_s_copy_from", and "_copy_same_type_". The onlyentry-point now is "
copy_".A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992
Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc
Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)
The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.
CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.