Skip to content

Comments

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8#23562

Merged
yf711 merged 6 commits intomainfrom
yifanl/cu128_build
Feb 13, 2025
Merged

Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8#23562
yf711 merged 6 commits intomainfrom
yifanl/cu128_build

Conversation

@yf711
Copy link
Contributor

@yf711 yf711 commented Feb 3, 2025

Description

When building ORT on windows with cuda 12.8, there were compile errors and log was prompting To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

This PR

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed

Motivation and Context

@yf711 yf711 changed the title enable rdc and skip error Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 Feb 3, 2025
@yf711 yf711 marked this pull request as ready for review February 3, 2025 07:29
@yf711 yf711 requested a review from snnn February 3, 2025 07:29
@snnn
Copy link
Contributor

snnn commented Feb 3, 2025

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

@yf711
Copy link
Contributor Author

yf711 commented Feb 4, 2025

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

I can't repro this issue on my env (sm75 gpu), but it seems stricter diagnosis with cuda 12.8 header files cause this error.
I suppress this 221 error. Please verify if this could help on your side

Copy link
Contributor

@snnn snnn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I tried it. It's good.

@tianleiwu
Copy link
Contributor

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

@yf711
Copy link
Contributor Author

yf711 commented Feb 5, 2025

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

The EP Perf CI runs on Ubuntu with python env. I tested on Windows desktop with/without this PR with few models (Resnet50, FRCNN) via ort_perf_test and saw similar results

@snnn
Copy link
Contributor

snnn commented Feb 5, 2025

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 5, 2025

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP.
TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

@yf711
Copy link
Contributor Author

yf711 commented Feb 5, 2025

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP. TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

Thanks for the comment. I just ran some perf comparison on windows desktop (T1000 GPU, sm75) with main branch, current PR and PR with LTO:

.\onnxruntime_perf_test.exe -e cuda -r 1000 Main (cu126) Cu128+Separation compilation Cu128+Separation compilation with lto
faster_rcnn_R_50_FPN_1x Average inference time cost: 24.8741 ms Average inference time cost: 24.7539 ms Average inference time cost: 25.3358 ms
resnet50-v2-7 Average inference time cost: 8.83048 ms Average inference time cost: 8.82946 ms Average inference time cost: 8.86747 ms

So far, I didn't see perf regression on CUDA EP, but I will find more model to test. Feel free to try this PR and let me know if you see perf regression

On other hand, I am still exploring on LTO, which might need broader config change.
If there's no significant perf change on this PR, I will merge it and adapt to LTO in another PR

@tianleiwu
Copy link
Contributor

I did some test using bert-large model on H100 and Ubuntu, and latency on batch size 16 and sequence length 256 increased by 1.2% after this change so it has some negative impact on performance.

BTW, building the wheel only (no tests) does not need this change in Linux. Shall we limit the scope like (Windows only, Test only etc)?

@snnn
Copy link
Contributor

snnn commented Feb 11, 2025

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 11, 2025

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

I run 3 times (10570 samples per benchmark). The average latency (in ms) of baseline (main branch): 2.495, 2.502, 2.504; latency of this branch: 2.545, 2.530, 2.537. There are some variance but the trend is same.

@yf711
Copy link
Contributor Author

yf711 commented Feb 12, 2025

Will merge this PR to unblock local ORT build on cuda12.8. Here's next PR to continue the work

@yf711 yf711 merged commit c95d828 into main Feb 13, 2025
91 of 106 checks passed
@yf711 yf711 deleted the yifanl/cu128_build branch February 13, 2025 07:02
yf711 added a commit that referenced this pull request Feb 21, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
guschmue pushed a commit that referenced this pull request Mar 6, 2025
)

### Description
<!-- Describe your changes. -->
When building ORT on windows with cuda 12.8, there were compile errors
and log was prompting `To resolve this issue, either use "-rdc=true", or
explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)`

This PR 
* enables `-rdc=true` ([Relocatable Device Code
(RDC)](https://forums.developer.nvidia.com/t/the-cost-of-relocatable-device-code-rdc-true/47665))
* enable
[CUDA_SEPARABLE_COMPILATION](https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html)
to support separate compilation of device code
* skips the 4505 compiler check, as enabling rdc would init check
towards internal linkage and make 4505 warning that treated as error

```
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
)

### Description
<!-- Describe your changes. -->
When building ORT on windows with cuda 12.8, there were compile errors
and log was prompting `To resolve this issue, either use "-rdc=true", or
explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)`

This PR 
* enables `-rdc=true` ([Relocatable Device Code
(RDC)](https://forums.developer.nvidia.com/t/the-cost-of-relocatable-device-code-rdc-true/47665))
* enable
[CUDA_SEPARABLE_COMPILATION](https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html)
to support separate compilation of device code
* skips the 4505 compiler check, as enabling rdc would init check
towards internal linkage and make 4505 warning that treated as error

```
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
rvinluan-sidefx pushed a commit to sideeffects/onnxruntime that referenced this pull request Jun 13, 2025
…rosoft#23562)

### Description
<!-- Describe your changes. -->
When building ORT on windows with cuda 12.8, there were compile errors
and log was prompting `To resolve this issue, either use "-rdc=true", or
explicitly set "-static-global-template-stub=false" (but see nvcc
documentation about downsides of turning it off)`

This PR 
* enables `-rdc=true` ([Relocatable Device Code
(RDC)](https://forums.developer.nvidia.com/t/the-cost-of-relocatable-device-code-rdc-true/47665))
* enable
[CUDA_SEPARABLE_COMPILATION](https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html)
to support separate compilation of device code
* skips the 4505 compiler check, as enabling rdc would init check
towards internal linkage and make 4505 warning that treated as error

```
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants