Skip to content

Conversation

@colesbury
Copy link
Member

PR #20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices. I didn't notice a difference on CUDA 10, but @ngimel says it's
important for CUDA too.

PR pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices.
@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: operators labels Jun 17, 2019
@colesbury colesbury requested a review from soumith June 17, 2019 21:36
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@colesbury
Copy link
Member Author

I've run the imagenet ResNet-18 example (without data loading) on four Vega 20 cards. Perf is ~0.113 ms/batch of 256 vs. ~0.175 ms/batch before this PR. (Before 320c385 perf was ~0.118 ms/batch).

@facebook-github-bot
Copy link
Contributor

@colesbury merged this pull request in cc4498a.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 18, 2019
Summary:
PR pytorch/pytorch#20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices.  I didn't notice a difference on CUDA 10, but ngimel says it's
important for CUDA too.
Pull Request resolved: pytorch/pytorch#21872

Differential Revision: D15863965

Pulled By: colesbury

fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants