Skip to content

Comments

Update cmake_cuda_architecture to control package size#23671

Merged
yf711 merged 31 commits intomainfrom
yifanl/cmake_arch
Feb 21, 2025
Merged

Update cmake_cuda_architecture to control package size#23671
yf711 merged 31 commits intomainfrom
yifanl/cmake_arch

Conversation

@yf711
Copy link
Contributor

@yf711 yf711 commented Feb 12, 2025

Description

Action item:

  • Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regressionLTO needs further testing

  • Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build;

    • make sure ORT nuget package < 250 MB, python wheel < 300 MB

    • Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like onnxruntime-blackwell, since the package size will be much larger than nuget/pypi repo limit

  • Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows:

    • With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible.
  • OS cmake_cuda_architecture package size
    Linux nupkg 60-real;70-real;75-real;80-real;90 215 MB
    Linux whl 60-real;70-real;75-real;80-real;86-real;90 268 MB
    Windows nupkg 52-real;61-real;75-real;86-real;89-real;90-virtual 197 MB
    Windows whl 52-real;61-real;75-real;86-real;89-real;90-virtual 204 MB
  • [TODO] Vaildate on Windows CUDA CI pipeline with cu128

Motivation and Context

Address discussed topics in #23562 and #23309

Stats

libonnxruntime_providers_cuda lib size Main 75;80;90 75-real;80-real;90-virtual 75-real;80;90-virtual 75-real;80-real;86-virtual;89-virtual;90-virtual 75-real;86-real;89 75-real;80;90 75-real;80-real;90 61-real;75-real;86-real;89
Linux 446 MB 241 MB 362 MB 482 MB N/A 422 MB 301 MB
Windows 417 MB 224 MB 338 MB 450 MB 279 MB N/A 292 MB
nupkg size Main 75;80;90 75-real;80-real;90-virtual 75-real;80;90-virtual 75-real;80-real;86-virtual;89-virtual;90-virtual 75-real;86-real;89 75-real;80;90 75-real;80-real;90 61-real;75-real;86-real;89
Linux 287 MB TBD 224 MB 299 MB 197 MB N/A
Windows 264 MB TBD 205 MB 274 MB N/A 188 MB
whl size Main 75;80;90 75-real;80-real;90-virtual 75-real;80;90-virtual 75-real;80-real;86-virtual;89-virtual;90-virtual 75-real;86-real;89 75-real;80;90 75-real;80-real;90 61-real;75-real;86-real;89
Linux 294 MB 154 MB TBD TBD N/A 278 MB 203 MB N/A
Windows 271 MB 142 MB TBD 280 MB 184 MB N/A N/A 194 MB

Reference

https://developer.nvidia.com/cuda-gpus
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization
PTX Compatibility
Application Compatibility on the NVIDIA Ada GPU Architecture
Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp

Track some failed/unfinished experiments to control package size:

  1. Build ORT with CUDNN_FRONTEND_SKIP_JSON_LIB=ON doesn't help much on package size;
  2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra -mx=9 doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi)
  3. Simply replacing sm_xx with lto_xx would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 13, 2025

How about we cover most data center GPUs for linux, and most consumer GPUs for Windows for nuget:
Linux: 75-real;80-real;90-real;90-virtual
Windows: 61-real;75-real;86-real;120-virtual

python package can be larger so so we can add more in python package.

@yf711 yf711 marked this pull request as ready for review February 19, 2025 01:50
tianleiwu
tianleiwu previously approved these changes Feb 20, 2025
@yf711 yf711 changed the title Control package size and add LTO support Update cmake_cuda_architecture to control package size Feb 21, 2025
@yf711 yf711 merged commit 1b0a2ba into main Feb 21, 2025
99 of 104 checks passed
@yf711 yf711 deleted the yifanl/cmake_arch branch February 21, 2025 18:18
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants