[CUDA] BiasSoftmax Supporting New Pattern#12361
Conversation
| .Attr("axis", "apply softmax to elements for dimensions axis or higher", AttributeProto::INT, static_cast<int64_t>(1)) | ||
| .Attr("is_inner_broadcast", "true if broadcast bias across input for dimensions broadcast_axis to axis-1, " | ||
| "otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1", | ||
| AttributeProto::INT) |
There was a problem hiding this comment.
"otherwise broadcast bias across input for dimensions 0 to broadcast_axis - 1" what is broadcast_axis here?
Do we need to support backward compatibility for this OP? Becasue this change does not look backward compatible?
There was a problem hiding this comment.
I just used "broadcast_axis" here trying to explain the idea. The previous broadcast_axis attribute in the schema is actually useless. The code used it to calculate the broadcasting size, but actually the size is just the input_size/bias_size. This is a design flaw at the beginning. For the backward compacity, I guess for this Op it's OK, it's created by fusion only, not possible to be in any existing graph, and it's for CUDA only, it just doesn't have the CPU kernel so that the KernelDef hash list just doesn't have it's hash there.
Current BiasSoftmax fusion requires the broadcasting part of bias input is in the middle, i.e., the input shape is [x, y, z] and bias shape is [x,1,z] (here x, y, z can be multiple same dims). From MoE model we found that input shape is [x, y, z] and bias shape is [1, y, z], which cannot be handled for now. This PR is to support this case.
The PR also refactored the BiasSoftmax code:
For perf, the changes has no impact the old pattern, for the new pattern, testing softmax(x[512,512,512] + y[1,512,512]) in V100, below shows the profiling result, the fused version is 1.8x faster:

before
after
