Update cmake_cuda_architecture to control package size by yf711 · Pull Request #23671 · microsoft/onnxruntime

yf711 · 2025-02-12T21:55:14Z

Description

Action item:

~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing
Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build;
- make sure ORT nuget package < 250 MB, python wheel < 300 MB
- Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like onnxruntime-blackwell, since the package size will be much larger than nuget/pypi repo limit
Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows:
- With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible.

OS	cmake_cuda_architecture	package size
Linux nupkg	60-real;70-real;75-real;80-real;90	215 MB
Linux whl	60-real;70-real;75-real;80-real;86-real;90	268 MB
Windows nupkg	52-real;61-real;75-real;86-real;89-real;90-virtual	197 MB
Windows whl	52-real;61-real;75-real;86-real;89-real;90-virtual	204 MB

[TODO] Vaildate on Windows CUDA CI pipeline with cu128

Motivation and Context

Address discussed topics in #23562 and #23309

Stats

libonnxruntime_providers_cuda lib size	Main 75;80;90	75-real;80-real;90-virtual	75-real;80;90-virtual	75-real;80-real;86-virtual;89-virtual;90-virtual	75-real;86-real;89	75-real;80;90	75-real;80-real;90	61-real;75-real;86-real;89
Linux	446 MB	241 MB	362 MB	482 MB	N/A	422 MB	301 MB
Windows	417 MB	224 MB	338 MB	450 MB	279 MB	N/A		292 MB

nupkg size	Main 75;80;90	75-real;80-real;90-virtual	75-real;80;90-virtual	75-real;80-real;86-virtual;89-virtual;90-virtual	75-real;86-real;89	75-real;80;90	75-real;80-real;90	61-real;75-real;86-real;89
Linux	287 MB	TBD	224 MB	299 MB			197 MB	N/A
Windows	264 MB	TBD	205 MB	274 MB			N/A	188 MB

whl size	Main 75;80;90	75-real;80-real;90-virtual	75-real;80;90-virtual	75-real;80-real;86-virtual;89-virtual;90-virtual	75-real;86-real;89	75-real;80;90	75-real;80-real;90	61-real;75-real;86-real;89
Linux	294 MB	154 MB	TBD	TBD	N/A	278 MB	203 MB	N/A
Windows	271 MB	142 MB	TBD	280 MB	184 MB	N/A	N/A	194 MB

Reference

https://developer.nvidia.com/cuda-gpus
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization
PTX Compatibility
Application Compatibility on the NVIDIA Ada GPU Architecture
Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp

Track some failed/unfinished experiments to control package size:

Build ORT with CUDNN_FRONTEND_SKIP_JSON_LIB=ON doesn't help much on package size;
ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra -mx=9 doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi)
Simply replacing sm_xx with lto_xx would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

tianleiwu · 2025-02-13T02:01:09Z

How about we cover most data center GPUs for linux, and most consumer GPUs for Windows for nuget:
Linux: 75-real;80-real;90-real;90-virtual
Windows: 61-real;75-real;86-real;120-virtual

python package can be larger so so we can add more in python package.

tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml

tools/ci_build/github/linux/build_cuda_c_api_package.sh

dockerfiles/Dockerfile.cuda

dockerfiles/Dockerfile.tensorrt

cmake/CMakeLists.txt

### Description  Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context  Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

yf711 added 4 commits February 12, 2025 00:37

test

c722db7

enable lto when cuda 12.8

9cc252c

fix

98ccb39

75-real;80-real;86-virtual;89-virtual;90-virtual

e38a761

yf711 mentioned this pull request Feb 12, 2025

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 #23562

Merged

75-real;80;90-virtual

c77ab55

yf711 added 4 commits February 18, 2025 10:50

Merge branch 'main' into yifanl/cmake_arch

4ac8364

Merge branch 'main' into yifanl/cmake_arch

c320fe0

Update

4a4da27

update dockerfile

2b53a76

yf711 marked this pull request as ready for review February 19, 2025 01:50

tianleiwu reviewed Feb 19, 2025

View reviewed changes

tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml Outdated Show resolved Hide resolved

yf711 added 4 commits February 19, 2025 00:24

sm61 support

b4b3693

update linux arch

cf4ba1a

test linux nupkg

f8af4d1

add Pascal and Volta support back to linux

9079ada

tianleiwu reviewed Feb 20, 2025

View reviewed changes

tools/ci_build/github/linux/build_cuda_c_api_package.sh Outdated Show resolved Hide resolved

yf711 added 3 commits February 19, 2025 19:16

60-real to support P100 and newer pascal

aa9d4d9

add 86-real to linux python wheels

63a97a8

add Maxwell/Hopper+ support to Windows GPU packages

a0669d6

tianleiwu previously approved these changes Feb 20, 2025

View reviewed changes

tianleiwu reviewed Feb 20, 2025

View reviewed changes

dockerfiles/Dockerfile.cuda Outdated Show resolved Hide resolved

tianleiwu reviewed Feb 20, 2025

View reviewed changes

dockerfiles/Dockerfile.tensorrt Outdated Show resolved Hide resolved

tianleiwu reviewed Feb 20, 2025

View reviewed changes

cmake/CMakeLists.txt Outdated Show resolved Hide resolved

clean unused

270d37a

yf711 dismissed tianleiwu’s stale review via 270d37a February 20, 2025 07:56

yf711 added 3 commits February 20, 2025 00:00

Merge branch 'main' into yifanl/cmake_arch

f67e8af

revert dockerfile, which need further update and validation

8cf25ac

enable LTO when CMAKE_CUDA_ARCHITECTURES is assigned

1147716

yf711 added 11 commits February 20, 2025 10:07

update

1d60660

save memory

9a87eb9

update

0760029

update

49e0a3d

Merge branch 'main' into yifanl/cmake_arch

f8a04e9

fix merge

51e24b0

test

be65720

test

c20ea59

Add execution permission

f5c0f3d

revert lto, need more testing

2b03063

lint

e66d42c

yf711 changed the title ~~Control package size and add LTO support~~ Update cmake_cuda_architecture to control package size Feb 21, 2025

tianleiwu approved these changes Feb 21, 2025

View reviewed changes

yf711 merged commit 1b0a2ba into main Feb 21, 2025
99 of 104 checks passed

yf711 deleted the yifanl/cmake_arch branch February 21, 2025 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Update cmake_cuda_architecture to control package size#23671

Update cmake_cuda_architecture to control package size#23671
yf711 merged 31 commits intomainfrom
yifanl/cmake_arch

yf711 commented Feb 12, 2025 •

edited

Loading

Uh oh!

tianleiwu commented Feb 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

yf711 commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Stats

Reference

Track some failed/unfinished experiments to control package size:

Uh oh!

tianleiwu commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yf711 commented Feb 12, 2025 •

edited

Loading

tianleiwu commented Feb 13, 2025 •

edited

Loading