Skip to content

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Jun 20, 2019

Re-implementation of the embedding_dense_backward_cuda() and the embedding_bag_backward_cuda_sum_avg() functions.

Performance

Running a Mortgage Workflow with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup:

Original version:    370,168 example/s
Optimized version: 1,034,228 example/s

The original version is bounded by the EmbeddingBag_accGradParametersKernel_sum_avg, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time.

Greater Numerical Stability

An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by NROWS_PER_THREAD before the final result are accumulated.

@pytorchbot pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn module: operators labels Jun 20, 2019
@madsbk madsbk force-pushed the embedding_kernel branch from 91d9ffa to 30a21f0 Compare June 20, 2019 14:08
@madsbk madsbk marked this pull request as ready for review June 20, 2019 14:09
@mrshenli
Copy link
Contributor

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers look awesome! Given that this adds extra complexity to speed up cases with a lot of duplicated indices, will it be slower if there is no duplication at all?

@mrshenli
Copy link
Contributor

@madsbk is there any existing test that checks if the results are correct? or do we need to add some? It's quite non-trivial (thanks for implementing it!!) and I am not confident that a review is sufficient to guarantee the correctness.

@soumith soumith added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 25, 2019
@madsbk
Copy link
Contributor Author

madsbk commented Jun 25, 2019

@mrshenli, test_nn.py has some tests of both bag and non-bag embeddings. I have been using py.test test/test_nn.py -k Embedding while developing. However, there is no test of the padding_idx argument in test_nn.py; it might be tested somewhere else?

@madsbk madsbk force-pushed the embedding_kernel branch from f8ade45 to 873092c Compare June 25, 2019 10:00
@madsbk
Copy link
Contributor Author

madsbk commented Jun 25, 2019

@mrshenli, this PR introduces a minor overhead if there is no duplication at all. But I do not think it is worth it to have the old code still around for that special case.
I suspect that the overhead of checking for the no-duplicates case is similar to the current overhead. However, it will reduce the memory usage since it can avoid the use of the temporary tensor grad_weight_per_segment.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing!

BTW, I saw there is a test_embedding_padding_idx in test_nn.py for non-bag embeddings.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jun 25, 2019
Summary:
Re-implementation of the `embedding_dense_backward_cuda()` and the `embedding_bag_backward_cuda_sum_avg()` functions.

#### Performance
Running a [Mortgage Workflow](https://github.com/EvenOldridge/MortgageWorkflowA) with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup:
```
Original version:    370,168 example/s
Optimized version: 1,034,228 example/s
```
The original version is bounded by the `EmbeddingBag_accGradParametersKernel_sum_avg`, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time.

#### Greater Numerical Stability
An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by `NROWS_PER_THREAD` before the final result are accumulated.
Pull Request resolved: pytorch/pytorch#22016

Differential Revision: D15944339

Pulled By: mrshenli

fbshipit-source-id: 398d5f48826a017fc4b31c24c3f8b56d01830bf0
@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 94e83da.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fp16, this truncates inermediate gradients per segment to fp16, which can hurt accuracy and lead to overflows/underflows. Also, this launches many more kernels than original implementation, which will likely hurt performance in cpu-bound cases. Finally, device_vector with default allocator should never be used.

partials_per_segment_offset.begin());

// The total number of partial-segments is the sum of `partials_per_segment_offset`
const int num_of_partial_segments = partials_per_segment[num_of_segments-1] +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct accesses to device_vector result in cudaMemcopy, synchronizing no less. They should be used very sparingly, if there's literally no way around it.

facebook-github-bot pushed a commit that referenced this pull request Jul 2, 2019
Summary:
Address the issue raised in #22377.

The PR #22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`.
In this PR, it now use a `float32` temporary tensor when the input is `float16`.

ngimel, can I get you to review? I think I have fixed the issues you have pointed out.
Pull Request resolved: #22401

Differential Revision: D16077319

Pulled By: mrshenli

fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 2, 2019
Summary:
Address the issue raised in pytorch/pytorch#22377.

The PR pytorch/pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`.
In this PR, it now use a `float32` temporary tensor when the input is `float16`.

ngimel, can I get you to review? I think I have fixed the issues you have pointed out.
Pull Request resolved: pytorch/pytorch#22401

Differential Revision: D16077319

Pulled By: mrshenli

fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
xzhu1900 pushed a commit to xzhu1900/pytorch that referenced this pull request Jul 5, 2019
Summary:
Address the issue raised in pytorch#22377.

The PR pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`.
In this PR, it now use a `float32` temporary tensor when the input is `float16`.

ngimel, can I get you to review? I think I have fixed the issues you have pointed out.
Pull Request resolved: pytorch#22401

Differential Revision: D16077319

Pulled By: mrshenli

fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants