Skip to content

[CUDA] BiasSoftmax Supporting New Pattern#12361

Merged
Lafi7e merged 7 commits intomasterfrom
weicwang/bias_softmax
Aug 4, 2022
Merged

[CUDA] BiasSoftmax Supporting New Pattern#12361
Lafi7e merged 7 commits intomasterfrom
weicwang/bias_softmax

Conversation

@Lafi7e
Copy link
Contributor

@Lafi7e Lafi7e commented Jul 28, 2022

Current BiasSoftmax fusion requires the broadcasting part of bias input is in the middle, i.e., the input shape is [x, y, z] and bias shape is [x,1,z] (here x, y, z can be multiple same dims). From MoE model we found that input shape is [x, y, z] and bias shape is [1, y, z], which cannot be handled for now. This PR is to support this case.

The PR also refactored the BiasSoftmax code:

  • Change the kernel function signature to get rid of Tensor class so that we can call it from some other kernels as a sub-part;
  • Remove the duplicated files for ROCm and use hipify to handle it.

For perf, the changes has no impact the old pattern, for the new pattern, testing softmax(x[512,512,512] + y[1,512,512]) in V100, below shows the profiling result, the fused version is 1.8x faster:
before
image

after
image

@Lafi7e Lafi7e added the training issues related to ONNX Runtime training; typically submitted using template label Jul 28, 2022
.Attr("axis", "apply softmax to elements for dimensions axis or higher", AttributeProto::INT, static_cast<int64_t>(1))
.Attr("is_inner_broadcast", "true if broadcast bias across input for dimensions broadcast_axis to axis-1, "
"otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1",
AttributeProto::INT)
Copy link
Contributor

@askhade askhade Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1" what is broadcast_axis here?

Do we need to support backward compatibility for this OP? Becasue this change does not look backward compatible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used "broadcast_axis" here trying to explain the idea. The previous broadcast_axis attribute in the schema is actually useless. The code used it to calculate the broadcasting size, but actually the size is just the input_size/bias_size. This is a design flaw at the beginning. For the backward compacity, I guess for this Op it's OK, it's created by fusion only, not possible to be in any existing graph, and it's for CUDA only, it just doesn't have the CPU kernel so that the KernelDef hash list just doesn't have it's hash there.

@Lafi7e Lafi7e merged commit 37995a7 into master Aug 4, 2022
@Lafi7e Lafi7e deleted the weicwang/bias_softmax branch August 4, 2022 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

training issues related to ONNX Runtime training; typically submitted using template

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants