-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Reduce Python and Nuget GPU package size #26002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The size of python wheel for Linux is 304 MB with "80-real;86-real;90a-real;90a-virtual" CUDA architecture enabled, which is still slightly over 300 MB size limit. SM80 (A100), SM86 (A10), SM90 (H100) seems to be the main GPUs our customers are using, and we can't remove them from ORT support list. |
|
Discussed offline that we might also want to reduce some heaviest cuda kernels, i.e. beam_search_topk. if (k <= 8) {
TopKLauncher(8)
} else {
ORT_THROW("K>8 is not supported for beam search");
}For Linux, Update: We don't need to modify beam_search_topk as removing FPA_INTB_GEMM can give us space back. |
### Description The package size limit for PyPI and Nuget are: - python package size under 300MB - Nuget package size under 250MB To meet the size limit, this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE. Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel. #### Python wheel | OS | cmake_cuda_architecture | CUDA kernel removal |Package size | Under 300MB| |---------|--------------------------------------------------------|-|-------------|---| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual | |341 MB |No (original)| | Linux | 70-real;75-real;80-real;86-real;90a-real;90a-virtual | | 329 MB |No| | Linux | 75-real;80-real;86-real;90a-real;90a-virtual | |319 MB |No| | Linux | 80-real;86-real;90a-real;90a-virtual | |304 MB |No| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual. | FPA_INTB_GEMM|287 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual | | 272 MB |Yes (original)| #### Nuget | OS | cmake_cuda_architecture | CUDA kernel removal |Package size |Under 250MB| |---------|--------------------------------------------------------|---|--------------|---| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual | |276 MB |No (original)| | Linux | 75-real;80-real;90a-real;90a-virtual | |253 MB |No| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual |FPA_INTB_GEMM| 230 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual || 264 MB |No (original)| | Windows | 61-real;75-real;86-real;89-real;90a-virtual || 254 MB |No| | Windows | 75-real;86-real;89-real;90a-virtual || 242 MB |Yes| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
|
This PR has been cherry-picked into the |
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
### Description The package size limit for PyPI and Nuget are: - python package size under 300MB - Nuget package size under 250MB To meet the size limit, this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE. Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel. #### Python wheel | OS | cmake_cuda_architecture | CUDA kernel removal |Package size | Under 300MB| |---------|--------------------------------------------------------|-|-------------|---| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual | |341 MB |No (original)| | Linux | 70-real;75-real;80-real;86-real;90a-real;90a-virtual | | 329 MB |No| | Linux | 75-real;80-real;86-real;90a-real;90a-virtual | |319 MB |No| | Linux | 80-real;86-real;90a-real;90a-virtual | |304 MB |No| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual. | FPA_INTB_GEMM|287 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual | | 272 MB |Yes (original)| #### Nuget | OS | cmake_cuda_architecture | CUDA kernel removal |Package size |Under 250MB| |---------|--------------------------------------------------------|---|--------------|---| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual | |276 MB |No (original)| | Linux | 75-real;80-real;90a-real;90a-virtual | |253 MB |No| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual |FPA_INTB_GEMM| 230 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual || 264 MB |No (original)| | Windows | 61-real;75-real;86-real;89-real;90a-virtual || 254 MB |No| | Windows | 75-real;86-real;89-real;90a-virtual || 242 MB |Yes| ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->
…osoft#26230) Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: microsoft#26002 microsoft#26226 microsoft#26181
Description
The package size limit for PyPI and Nuget are:
To meet the size limit,
this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE.
Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel.
Python wheel
Nuget
Motivation and Context