Optimization of the Embedding and Embedding-Bag CUDA Kernel #22016

madsbk · 2019-06-20T07:49:12Z

Re-implementation of the embedding_dense_backward_cuda() and the embedding_bag_backward_cuda_sum_avg() functions.

Performance

Running a Mortgage Workflow with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup:

Original version:    370,168 example/s
Optimized version: 1,034,228 example/s

The original version is bounded by the EmbeddingBag_accGradParametersKernel_sum_avg, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time.

Greater Numerical Stability

An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by NROWS_PER_THREAD before the final result are accumulated.

test/test_nn.py

mrshenli · 2019-06-21T17:40:16Z

@pytorchbot retest this please

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

The numbers look awesome! Given that this adds extra complexity to speed up cases with a lot of duplicated indices, will it be slower if there is no duplication at all?

aten/src/ATen/native/cuda/Embedding.cu

aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu

mrshenli · 2019-06-21T21:46:09Z

@madsbk is there any existing test that checks if the results are correct? or do we need to add some? It's quite non-trivial (thanks for implementing it!!) and I am not confident that a review is sufficient to guarantee the correctness.

Since the non-bag version now also uses the optimized version, this is not needed anymore.

madsbk · 2019-06-25T10:00:06Z

@mrshenli, test_nn.py has some tests of both bag and non-bag embeddings. I have been using py.test test/test_nn.py -k Embedding while developing. However, there is no test of the padding_idx argument in test_nn.py; it might be tested somewhere else?

madsbk · 2019-06-25T11:25:58Z

@mrshenli, this PR introduces a minor overhead if there is no duplication at all. But I do not think it is worth it to have the old code still around for that special case.
I suspect that the overhead of checking for the no-duplicates case is similar to the current overhead. However, it will reduce the memory usage since it can avoid the use of the temporary tensor grad_weight_per_segment.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

Thanks for contributing!

BTW, I saw there is a test_embedding_padding_idx in test_nn.py for non-bag embeddings.

Summary: Re-implementation of the `embedding_dense_backward_cuda()` and the `embedding_bag_backward_cuda_sum_avg()` functions. #### Performance Running a [Mortgage Workflow](https://github.com/EvenOldridge/MortgageWorkflowA) with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup: ``` Original version: 370,168 example/s Optimized version: 1,034,228 example/s ``` The original version is bounded by the `EmbeddingBag_accGradParametersKernel_sum_avg`, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time. #### Greater Numerical Stability An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by `NROWS_PER_THREAD` before the final result are accumulated. Pull Request resolved: pytorch/pytorch#22016 Differential Revision: D15944339 Pulled By: mrshenli fbshipit-source-id: 398d5f48826a017fc4b31c24c3f8b56d01830bf0

facebook-github-bot · 2019-06-25T19:36:34Z

@mrshenli merged this pull request in 94e83da.

ngimel

For fp16, this truncates inermediate gradients per segment to fp16, which can hurt accuracy and lead to overflows/underflows. Also, this launches many more kernels than original implementation, which will likely hurt performance in cpu-bound cases. Finally, device_vector with default allocator should never be used.

aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu

ngimel · 2019-06-30T19:09:20Z

aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu

+          partials_per_segment_offset.begin());
+
+  // The total number of partial-segments is the sum of `partials_per_segment_offset`
+  const int num_of_partial_segments = partials_per_segment[num_of_segments-1] +


Direct accesses to device_vector result in cudaMemcopy, synchronizing no less. They should be used very sparingly, if there's literally no way around it.

aten/src/ATen/native/cuda/EmbeddingBackwardKernel.cu

Summary: Address the issue raised in #22377. The PR #22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: #22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309

Summary: Address the issue raised in pytorch/pytorch#22377. The PR pytorch/pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: pytorch/pytorch#22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309

Summary: Address the issue raised in pytorch#22377. The PR pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`. In this PR, it now use a `float32` temporary tensor when the input is `float16`. ngimel, can I get you to review? I think I have fixed the issues you have pointed out. Pull Request resolved: pytorch#22401 Differential Revision: D16077319 Pulled By: mrshenli fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309

pytorchbot added module: cuda Related to torch.cuda, and CUDA support in general module: nn Related to torch.nn module: operators labels Jun 20, 2019

ezyang added the open source label Jun 20, 2019

madsbk force-pushed the embedding_kernel branch from 91d9ffa to 30a21f0 Compare June 20, 2019 14:08

madsbk marked this pull request as ready for review June 20, 2019 14:09

ssnl reviewed Jun 20, 2019

View reviewed changes

test/test_nn.py Outdated Show resolved Hide resolved

facebook-github-bot reviewed Jun 21, 2019

View reviewed changes

mrshenli reviewed Jun 21, 2019

View reviewed changes

soumith added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 25, 2019

madsbk added 15 commits June 25, 2019 10:32

Simplified EmbeddingBag_accGradParametersKernel_sum_avg

5252bae

Now spawning a thread per segment (not row)

b57ceca

Swapped the x- and y-axis

e368cdc

Added a scatter kernel

225df3c

Added a common file for embedding CUDA kernels

d5ea882

Combined the bag and non-bag version into one function

42ccec0

Using ceil_div()

9669bbe

Testing: reverted the reduced comparison precision

bdc9990

Since the non-bag version now also uses the optimized version, this is not needed anymore.

added docs

0732c3a

Using include brackets instead of quotes

bbfadf3

now using NROWS_PER_THREAD directly in the kernels

869e8b2

reformat

550261f

Changed var names to make more sense and added a lot of comments

54792b8

typo

57e75b3

padding_idx is now checked against indices[idx]

07a371c

clean up

873092c

madsbk force-pushed the embedding_kernel branch from f8ade45 to 873092c Compare June 25, 2019 10:00

facebook-github-bot reviewed Jun 25, 2019

View reviewed changes

mrshenli approved these changes Jun 25, 2019

View reviewed changes

facebook-github-bot closed this in 94e83da Jun 25, 2019

facebook-github-bot added the merged label Jun 25, 2019

mrshenli mentioned this pull request Jun 30, 2019

Back out "[pytorch][PR] Optimization of the Embedding and Embedding-Bag CUDA Kernel" #22377

Closed

ngimel reviewed Jun 30, 2019

View reviewed changes

madsbk mentioned this pull request Jul 1, 2019

Numerical stability of embedding kernels #22401

Closed

mruberry added the Merged label Oct 28, 2020

Optimization of the Embedding and Embedding-Bag CUDA Kernel #22016

Optimization of the Embedding and Embedding-Bag CUDA Kernel #22016

Uh oh!

Conversation

madsbk commented Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Greater Numerical Stability

Uh oh!

Uh oh!

mrshenli commented Jun 21, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrshenli commented Jun 21, 2019

Uh oh!

madsbk commented Jun 25, 2019

Uh oh!

madsbk commented Jun 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 25, 2019

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngimel Jun 30, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

madsbk commented Jun 20, 2019 •

edited

Loading

madsbk commented Jun 25, 2019 •

edited

Loading