Using vectorized loads (float2) for fp16 to improve performance#11390
Using vectorized loads (float2) for fp16 to improve performance#11390hariharans29 merged 6 commits intomicrosoft:masterfrom
Conversation
|
There are a few warnings from cpplint. Please format the code (see .clang-format in the root directory). One quick way is to rename fast_gelu_impl.cu to fast_gelu_impl.cc, then use Visual Studio to format the file, then undo the rename. |
| const float2* bias_cast = reinterpret_cast<const float2*>(bias); | ||
| float2* output_cast = reinterpret_cast<float2*>(output); | ||
|
|
||
| const half2 two2 = __floats2half2_rn(two, two); |
There was a problem hiding this comment.
Curious: Why not just [__float2half2_rn](https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF__MISC.html#group__CUDA__MATH____HALF__MISC_1ge40813c17ab4b0779764e2e5e3014019) since both halves are going to be populated with the same value ?
There was a problem hiding this comment.
Yes, thanks for pointing this out.
| template <unsigned TPB> | ||
| __global__ void FastGeluKernel4Bias(const half2 a, const half2 b, const half2 c, int input_length, int bias_length, | ||
| const half* input, const half* bias, half* output) { | ||
| const half2 two2 = __float2half2_rn(two); |
There was a problem hiding this comment.
Hari mentioned that the following code could be moved inside the if block:
const half2 two2 = __float2half2_rn(two);
const half2 one2 = __float2half2_rn(one);
const float2* input_cast = reinterpret_cast<const float2*>(input);
const float2* bias_cast = reinterpret_cast<const float2*>(bias);
float2* output_cast = reinterpret_cast<float2*>(output);
That could save computation in some cases. Similar change can be made in other functions.
|
The perf numbers in the following table were collected on MI200.
|
|
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
|
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
|
@tianleiwu @weixingzhang : Any more comments ? |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
Azure Pipelines successfully started running 10 pipeline(s). |
|
Description: Describe your changes.
Optimized LaunchFastGeluKernel for fp16 on AMD GPUs.
Motivation and Context
The performance gap is observed in FastGelu kernels with the microbenchmark on MI200 vs. A100. With the optimized kernels, we are able to see the significant performance improvement compared to A100.