Update cmake_cuda_architecture to control package size#23671
Merged
Conversation
Contributor
|
How about we cover most data center GPUs for linux, and most consumer GPUs for Windows for nuget: python package can be larger so so we can add more in python package. |
tianleiwu
reviewed
Feb 19, 2025
tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml
Outdated
Show resolved
Hide resolved
tianleiwu
reviewed
Feb 20, 2025
tianleiwu
previously approved these changes
Feb 20, 2025
tianleiwu
reviewed
Feb 20, 2025
tianleiwu
reviewed
Feb 20, 2025
tianleiwu
reviewed
Feb 20, 2025
tianleiwu
approved these changes
Feb 21, 2025
guschmue
pushed a commit
that referenced
this pull request
Mar 6, 2025
### Description <!-- Describe your changes. --> Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.
ashrit-ms
pushed a commit
that referenced
this pull request
Mar 17, 2025
### Description <!-- Describe your changes. --> Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Action item:
Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regressionLTO needs further testingReduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build;
make sure ORT nuget package < 250 MB, python wheel < 300 MB
Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like onnxruntime-blackwell, since the package size will be much larger than nuget/pypi repo limit
Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows:
[TODO] Vaildate on Windows CUDA CI pipeline with cu128
Motivation and Context
Address discussed topics in #23562 and #23309
Stats
Reference
https://developer.nvidia.com/cuda-gpus
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization
PTX Compatibility
Application Compatibility on the NVIDIA Ada GPU Architecture
Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp
Track some failed/unfinished experiments to control package size:
CUDNN_FRONTEND_SKIP_JSON_LIB=ONdoesn't help much on package size;-mx=9doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi)sm_xxwithlto_xxwould increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.