Reduce Python and Nuget GPU package size #26002

chilo-ms · 2025-09-09T23:22:28Z

Description

The package size limit for PyPI and Nuget are:

python package size under 300MB
Nuget package size under 250MB

To meet the size limit,
this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE.
Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel.

Python wheel

OS	cmake_cuda_architecture	CUDA kernel removal	Package size	Under 300MB
Linux	60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual		341 MB	No (original)
Linux	70-real;75-real;80-real;86-real;90a-real;90a-virtual		329 MB	No
Linux	75-real;80-real;86-real;90a-real;90a-virtual		319 MB	No
Linux	80-real;86-real;90a-real;90a-virtual		304 MB	No
Linux	60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual.	FPA_INTB_GEMM	287 MB	Yes
Windows	52-real;61-real;75-real;86-real;89-real;90a-virtual		272 MB	Yes (original)

Nuget

OS	cmake_cuda_architecture	CUDA kernel removal	Package size	Under 250MB
Linux	60-real;70-real;75-real;80-real;90a-real;90a-virtual		276 MB	No (original)
Linux	75-real;80-real;90a-real;90a-virtual		253 MB	No
Linux	60-real;70-real;75-real;80-real;90a-real;90a-virtual	FPA_INTB_GEMM	230 MB	Yes
Windows	52-real;61-real;75-real;86-real;89-real;90a-virtual		264 MB	No (original)
Windows	61-real;75-real;86-real;89-real;90a-virtual		254 MB	No
Windows	75-real;86-real;89-real;90a-virtual		242 MB	Yes

Motivation and Context

tools/ci_build/github/linux/build_cuda_c_api_package.sh

tools/ci_build/github/linux/build_tensorrt_c_api_package.sh

tools/ci_build/github/linux/build_linux_python_package.sh

chilo-ms · 2025-09-11T17:50:03Z

The size of python wheel for Linux is 304 MB with "80-real;86-real;90a-real;90a-virtual" CUDA architecture enabled, which is still slightly over 300 MB size limit.
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=936299&view=artifacts&pathAsName=false&type=publishedArtifacts

SM80 (A100), SM86 (A10), SM90 (H100) seems to be the main GPUs our customers are using, and we can't remove them from ORT support list.
Another option is to sacrifice the performance meaning remove 86-real or 90-real (SASS) and add the virtual one (PTX) for compatibility.

chilo-ms · 2025-09-15T23:08:26Z

Discussed offline that we might also want to reduce some heaviest cuda kernels, i.e. beam_search_topk.
With only keeping branch of 8:

if (k <= 8) {
    TopKLauncher(8)
  } else {
    ORT_THROW("K>8 is not supported for beam search");
  }

For Linux,
Now CUDA EP library is 448MB.
The wheel is ~302MB.

Update: We don't need to modify beam_search_topk as removing FPA_INTB_GEMM can give us space back.
https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=942363&view=artifacts&pathAsName=false&type=publishedArtifacts

### Description The package size limit for PyPI and Nuget are: - python package size under 300MB - Nuget package size under 250MB To meet the size limit, this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE. Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel. #### Python wheel | OS | cmake_cuda_architecture | CUDA kernel removal |Package size | Under 300MB| |---------|--------------------------------------------------------|-|-------------|---| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual | |341 MB |No (original)| | Linux | 70-real;75-real;80-real;86-real;90a-real;90a-virtual | | 329 MB |No| | Linux | 75-real;80-real;86-real;90a-real;90a-virtual | |319 MB |No| | Linux | 80-real;86-real;90a-real;90a-virtual | |304 MB |No| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual. | FPA_INTB_GEMM|287 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual | | 272 MB |Yes (original)| #### Nuget | OS | cmake_cuda_architecture | CUDA kernel removal |Package size |Under 250MB| |---------|--------------------------------------------------------|---|--------------|---| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual | |276 MB |No (original)| | Linux | 75-real;80-real;90a-real;90a-virtual | |253 MB |No| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual |FPA_INTB_GEMM| 230 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual || 264 MB |No (original)| | Windows | 61-real;75-real;86-real;89-real;90a-virtual || 254 MB |No| | Windows | 75-real;86-real;89-real;90a-virtual || 242 MB |Yes| ### Motivation and Context

Reduce Python and Nuget GPU package size (#26002) [CUDA] Add build flag onnxruntime_USE_FPA_INTB_GEMM (#25802)

snnn · 2025-09-19T19:23:47Z

This PR has been cherry-picked into the rel-1.23.0 branch in PR #26087. Removing the release:1.23.0 label.

Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181

### Description The package size limit for PyPI and Nuget are: - python package size under 300MB - Nuget package size under 250MB To meet the size limit, this PR firstly removes some old GPU arch support in CMAKE_CUDA_ARCHITECTURE. Secondly, it removes the FPA_INTB_GEMM support in Linux Python wheel. #### Python wheel | OS | cmake_cuda_architecture | CUDA kernel removal |Package size | Under 300MB| |---------|--------------------------------------------------------|-|-------------|---| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual | |341 MB |No (original)| | Linux | 70-real;75-real;80-real;86-real;90a-real;90a-virtual | | 329 MB |No| | Linux | 75-real;80-real;86-real;90a-real;90a-virtual | |319 MB |No| | Linux | 80-real;86-real;90a-real;90a-virtual | |304 MB |No| | Linux | 60-real;70-real;75-real;80-real;86-real;90a-real;90a-virtual. | FPA_INTB_GEMM|287 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual | | 272 MB |Yes (original)| #### Nuget | OS | cmake_cuda_architecture | CUDA kernel removal |Package size |Under 250MB| |---------|--------------------------------------------------------|---|--------------|---| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual | |276 MB |No (original)| | Linux | 75-real;80-real;90a-real;90a-virtual | |253 MB |No| | Linux | 60-real;70-real;75-real;80-real;90a-real;90a-virtual |FPA_INTB_GEMM| 230 MB |Yes| | Windows | 52-real;61-real;75-real;86-real;89-real;90a-virtual || 264 MB |No (original)| | Windows | 61-real;75-real;86-real;89-real;90a-virtual || 254 MB |No| | Windows | 75-real;86-real;89-real;90a-virtual || 242 MB |Yes| ### Motivation and Context

…osoft#26230) Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: microsoft#26002 microsoft#26226 microsoft#26181

chilo-ms added 2 commits September 9, 2025 16:18

remove the oldest CUDA arch

b17b0a2

update

86f73d2

snnn requested a review from tianleiwu September 10, 2025 15:50

revmoe older arch in Linux python whl, Linux nuget and windows nuget

4ae3551

tianleiwu reviewed Sep 10, 2025

View reviewed changes

tools/ci_build/github/linux/build_cuda_c_api_package.sh Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 10, 2025

View reviewed changes

tools/ci_build/github/linux/build_tensorrt_c_api_package.sh Outdated Show resolved Hide resolved

tianleiwu reviewed Sep 10, 2025

View reviewed changes

tools/ci_build/github/linux/build_linux_python_package.sh Outdated Show resolved Hide resolved

chilo-ms added 2 commits September 10, 2025 11:17

remove old GPU arch for windows nuget

c32103d

remove old GPU arch for python wheel

69ddc16

jywu-msft added the release:1.23.0 label Sep 15, 2025

reduce number of k support in beam search

64f7967

chilo-ms added 3 commits September 16, 2025 13:01

revert beam_search_topk

bc3f01c

remove FPA_INTB_GEMM for Linux python wheel

9b226b1

revert beam_search_topk

b28e74b

chilo-ms changed the title ~~Remove old CUDA arch in CMAKE_CUDA_ARCHITECTURES to reduce package size~~ Reduce Python and Nuget GPU package size Sep 16, 2025

tianleiwu previously approved these changes Sep 17, 2025

View reviewed changes

Add back support for SM75 on Linux and disable FPA_INTB_GEMM

5aeb874

chilo-ms dismissed tianleiwu’s stale review via 5aeb874 September 17, 2025 18:51

snnn approved these changes Sep 18, 2025

View reviewed changes

tianleiwu approved these changes Sep 18, 2025

View reviewed changes

chilo-ms merged commit fd35afb into main Sep 18, 2025
105 of 115 checks passed

chilo-ms deleted the chi/remove_cuda_arch branch September 18, 2025 21:01

snnn pushed a commit that referenced this pull request Sep 19, 2025

Cherry-pick: Reduce Python and Nuget GPU package size (#26002) (#26087)

2a034d5

Reduce Python and Nuget GPU package size (#26002) [CUDA] Add build flag onnxruntime_USE_FPA_INTB_GEMM (#25802)

snnn removed the release:1.23.0 label Sep 19, 2025

snnn mentioned this pull request Oct 1, 2025

no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu] #26181

Closed

tianleiwu mentioned this pull request Oct 3, 2025

[CUDA] replace 90a-virtual by 90-virtual for forward compatible #26230

Merged

XXXXRT666 mentioned this pull request Oct 12, 2025

Reduce ONNX Runtime GPU wheel size using fatbin compression #26282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce Python and Nuget GPU package size #26002

Reduce Python and Nuget GPU package size #26002

Uh oh!

chilo-ms commented Sep 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chilo-ms commented Sep 11, 2025 •

edited

Loading

Uh oh!

chilo-ms commented Sep 15, 2025 •

edited

Loading

Uh oh!

Uh oh!

snnn commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Reduce Python and Nuget GPU package size #26002

Reduce Python and Nuget GPU package size #26002

Uh oh!

Conversation

chilo-ms commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Python wheel

Nuget

Motivation and Context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chilo-ms commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chilo-ms commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

snnn commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chilo-ms commented Sep 9, 2025 •

edited

Loading

chilo-ms commented Sep 11, 2025 •

edited

Loading

chilo-ms commented Sep 15, 2025 •

edited

Loading