-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Optimization of the Embedding and Embedding-Bag CUDA Kernel #22016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers look awesome! Given that this adds extra complexity to speed up cases with a lot of duplicated indices, will it be slower if there is no duplication at all?
|
@madsbk is there any existing test that checks if the results are correct? or do we need to add some? It's quite non-trivial (thanks for implementing it!!) and I am not confident that a review is sufficient to guarantee the correctness. |
Since the non-bag version now also uses the optimized version, this is not needed anymore.
|
@mrshenli, |
|
@mrshenli, this PR introduces a minor overhead if there is no duplication at all. But I do not think it is worth it to have the old code still around for that special case. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing!
BTW, I saw there is a test_embedding_padding_idx in test_nn.py for non-bag embeddings.
Summary: Re-implementation of the `embedding_dense_backward_cuda()` and the `embedding_bag_backward_cuda_sum_avg()` functions. #### Performance Running a [Mortgage Workflow](https://github.com/EvenOldridge/MortgageWorkflowA) with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup: ``` Original version: 370,168 example/s Optimized version: 1,034,228 example/s ``` The original version is bounded by the `EmbeddingBag_accGradParametersKernel_sum_avg`, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time. #### Greater Numerical Stability An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by `NROWS_PER_THREAD` before the final result are accumulated. Pull Request resolved: pytorch/pytorch#22016 Differential Revision: D15944339 Pulled By: mrshenli fbshipit-source-id: 398d5f48826a017fc4b31c24c3f8b56d01830bf0
ngimel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For fp16, this truncates inermediate gradients per segment to fp16, which can hurt accuracy and lead to overflows/underflows. Also, this launches many more kernels than original implementation, which will likely hurt performance in cpu-bound cases. Finally, device_vector with default allocator should never be used.
| partials_per_segment_offset.begin()); | ||
|
|
||
| // The total number of partial-segments is the sum of `partials_per_segment_offset` | ||
| const int num_of_partial_segments = partials_per_segment[num_of_segments-1] + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Direct accesses to device_vector result in cudaMemcopy, synchronizing no less. They should be used very sparingly, if there's literally no way around it.
Summary: Address the issue raised in #22377. The PR #22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: #22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Summary: Address the issue raised in pytorch/pytorch#22377. The PR pytorch/pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: pytorch/pytorch#22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Summary: Address the issue raised in pytorch#22377. The PR pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: pytorch#22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Re-implementation of the
embedding_dense_backward_cuda()and theembedding_bag_backward_cuda_sum_avg()functions.Performance
Running a Mortgage Workflow with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup:
The original version is bounded by the
EmbeddingBag_accGradParametersKernel_sum_avg, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time.Greater Numerical Stability
An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by
NROWS_PER_THREADbefore the final result are accumulated.