[CUDA] BiasSoftmax Supporting New Pattern by Lafi7e · Pull Request #12361 · microsoft/onnxruntime

Lafi7e · 2022-07-28T07:16:17Z

Current BiasSoftmax fusion requires the broadcasting part of bias input is in the middle, i.e., the input shape is [x, y, z] and bias shape is [x,1,z] (here x, y, z can be multiple same dims). From MoE model we found that input shape is [x, y, z] and bias shape is [1, y, z], which cannot be handled for now. This PR is to support this case.

The PR also refactored the BiasSoftmax code:

Change the kernel function signature to get rid of Tensor class so that we can call it from some other kernels as a sub-part;
Remove the duplicated files for ROCm and use hipify to handle it.

For perf, the changes has no impact the old pattern, for the new pattern, testing softmax(x[512,512,512] + y[1,512,512]) in V100, below shows the profiling result, the fused version is 1.8x faster:
before

after

askhade · 2022-08-03T19:38:29Z

onnxruntime/core/graph/contrib_ops/contrib_defs.cc

+                                .Attr("axis", "apply softmax to elements for dimensions axis or higher", AttributeProto::INT, static_cast<int64_t>(1))
+                                .Attr("is_inner_broadcast", "true if broadcast bias across input for dimensions broadcast_axis to axis-1, "
+                                      "otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1",
+                                      AttributeProto::INT)


"otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1" what is broadcast_axis here?

Do we need to support backward compatibility for this OP? Becasue this change does not look backward compatible?

I just used "broadcast_axis" here trying to explain the idea. The previous broadcast_axis attribute in the schema is actually useless. The code used it to calculate the broadcasting size, but actually the size is just the input_size/bias_size. This is a design flaw at the beginning. For the backward compacity, I guess for this Op it's OK, it's created by fusion only, not possible to be in any existing graph, and it's for CUDA only, it just doesn't have the CPU kernel so that the KernelDef hash list just doesn't have it's hash there.

Lafi7e added 2 commits July 28, 2022 14:33

bias softmax supporting new pattern

66d84d3

Merge branch 'master' into weicwang/bias_softmax

6d8cff4

Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Jul 28, 2022

Lafi7e requested review from askhade, baijumeswani and pengwa July 28, 2022 07:16

Lafi7e added 4 commits July 28, 2022 15:41

fix build error

de27bf1

fix ut

aa24e82

fix ut

aa3d652

fix ut

57f5241

askhade reviewed Aug 3, 2022

View reviewed changes

Merge branch 'master' into weicwang/bias_softmax

d43b23a

askhade approved these changes Aug 4, 2022

View reviewed changes

Lafi7e merged commit 37995a7 into master Aug 4, 2022

Lafi7e deleted the weicwang/bias_softmax branch August 4, 2022 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] BiasSoftmax Supporting New Pattern#12361

[CUDA] BiasSoftmax Supporting New Pattern#12361
Lafi7e merged 7 commits intomasterfrom
weicwang/bias_softmax

Lafi7e commented Jul 28, 2022

Uh oh!

askhade Aug 3, 2022 •

edited

Loading

Uh oh!

Lafi7e Aug 4, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lafi7e commented Jul 28, 2022

Uh oh!

askhade Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lafi7e Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

askhade Aug 3, 2022 •

edited

Loading