Optimize FastGelu with float2 and float4 vectorized kernels on ROCm#11491
Optimize FastGelu with float2 and float4 vectorized kernels on ROCm#11491zhangyaobit merged 21 commits intomicrosoft:masterfrom
Conversation
|
We might be able to use onnxruntime/onnxruntime/core/providers/cuda/nn/dropout_impl.cu Lines 93 to 117 in 6a7d3de |
|
@tianleiwu could you please help me review this PR? Thanks. |
Are there any problems of using aligned_vector here? It looks like we should use it, as Peixuan previously recommended. I believe with aligned_vector, we don't need maintain two copies of mostly similar code respectively for float2 and float4, we just need a single template code which could be instantiated twice. Example code could be found here: onnxruntime/core/providers/cuda/math/softmax_blockwise_impl.cuh:270 |
|
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
|
Azure Pipelines successfully started running 10 pipeline(s). |
|
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
The failed tests seem to be related opset 17 stuff. Can u pls merge your code with master again and see if the failed tests go away? |
|
Commenter does not have sufficient privileges for PR 11491 in repo microsoft/onnxruntime |
|
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
|
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 10 pipeline(s). |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
/azp run Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline, Linux Nuphar CI Pipeline, Linux OpenVINO CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows CPU CI Pipeline, Windows GPU CI Pipeline |
|
/azp run Windows GPU TensorRT CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, onnxruntime-python-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline, orttraining-ortmodule-distributed |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
Azure Pipelines successfully started running 10 pipeline(s). |
Description: Describe your changes.
Optimized FastGeluKernel on ROCm.
It is relevant to the earlier PR: #11390.
Motivation and Context