Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 by yf711 · Pull Request #23562 · microsoft/onnxruntime

yf711 · 2025-02-03T01:57:48Z

Description

When building ORT on windows with cuda 12.8, there were compile errors and log was prompting To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)

This PR

enables -rdc=true (Relocatable Device Code (RDC))
enable CUDA_SEPARABLE_COMPILATION to support separate compilation of device code
skips the 4505 compiler check, as enabling rdc would init check towards internal linkage and make 4505 warning that treated as error

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj]
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed

Motivation and Context

cmake/onnxruntime_providers_cuda.cmake

snnn · 2025-02-03T22:59:31Z

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

yf711 · 2025-02-04T00:11:29Z

I still see errors like:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\cuda/std/detail/libcxx/include/cmath(1032): error #221-D: floating-point value does not fit in required floating-point type [D:\onnxruntime\b\Debug\onnxruntime_providers_cuda.vcxproj]
      if (__r >= ::nextafter(static_cast<_RealT>(_MaxVal), ((float)(1e+300))))
```                                                            ^

I can't repro this issue on my env (sm75 gpu), but it seems stricter diagnosis with cuda 12.8 header files cause this error.
I suppress this 221 error. Please verify if this could help on your side

snnn

Thanks. I tried it. It's good.

tianleiwu · 2025-02-04T22:49:56Z

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

yf711 · 2025-02-05T03:37:35Z

@yf711, Could you run some benchmark to see how much performance impact using separate compilation (thus no longer fully optimized)?

If we look at the graph in the https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/. The impact is actually very large:

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

The EP Perf CI runs on Ubuntu with python env. I tested on Windows desktop with/without this PR with few models (Resnet50, FRCNN) via ort_perf_test and saw similar results

snnn · 2025-02-05T04:09:24Z

I tried your branch in our Windows CUDA CI pipeline, but there were some errors:

https://dev.azure.com/onnxruntime/2a773b67-e88b-4c7f-9fc0-87d31fea8ef2/_apis/build/builds/1606829/logs/30

tianleiwu · 2025-02-05T17:44:12Z

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP.
TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

yf711 · 2025-02-05T19:56:01Z

I just ran benchmark on EP Perf CI with series of onnx zoo models and it seems the perf has no significant regression compared to the main branch. Some models are slight faster/slower than main branch, but their latency diff is within 5%.

Please make sure you test the perf of CUDA EP instead of TRT EP. TRT EP is linked with pre-compiled TRT library so it is less impacted by this option.

Thanks for the comment. I just ran some perf comparison on windows desktop (T1000 GPU, sm75) with main branch, current PR and PR with LTO:

.\onnxruntime_perf_test.exe -e cuda -r 1000	Main (cu126)	Cu128+Separation compilation	Cu128+Separation compilation with lto
faster_rcnn_R_50_FPN_1x	Average inference time cost: 24.8741 ms	Average inference time cost: 24.7539 ms	Average inference time cost: 25.3358 ms
resnet50-v2-7	Average inference time cost: 8.83048 ms	Average inference time cost: 8.82946 ms	Average inference time cost: 8.86747 ms

So far, I didn't see perf regression on CUDA EP, but I will find more model to test. Feel free to try this PR and let me know if you see perf regression

On other hand, I am still exploring on LTO, which might need broader config change.
If there's no significant perf change on this PR, I will merge it and adapt to LTO in another PR

tianleiwu · 2025-02-07T02:27:15Z

I did some test using bert-large model on H100 and Ubuntu, and latency on batch size 16 and sequence length 256 increased by 1.2% after this change so it has some negative impact on performance.

BTW, building the wheel only (no tests) does not need this change in Linux. Shall we limit the scope like (Windows only, Test only etc)?

snnn · 2025-02-11T01:55:01Z

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

tianleiwu · 2025-02-11T19:01:31Z

The 1.2% change isn't a variance? I don't know much about CUDA. But, our CPU build's performance typically varies larger than that. I mean, if you run the same benchmark again and again, the number varies.

I run 3 times (10570 samples per benchmark). The average latency (in ms) of baseline (main branch): 2.495, 2.502, 2.504; latency of this branch: 2.545, 2.530, 2.537. There are some variance but the trend is same.

yf711 · 2025-02-12T21:59:05Z

Will merge this PR to unblock local ORT build on cuda12.8. Here's next PR to continue the work

### Description  Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context  Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

) ### Description  When building ORT on windows with cuda 12.8, there were compile errors and log was prompting `To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)` This PR * enables `-rdc=true` ([Relocatable Device Code (RDC)](https://forums.developer.nvidia.com/t/the-cost-of-relocatable-device-code-rdc-true/47665)) * enable [CUDA_SEPARABLE_COMPILATION](https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html) to support separate compilation of device code * skips the 4505 compiler check, as enabling rdc would init check towards internal linkage and make 4505 warning that treated as error ``` C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed ``` ### Motivation and Context

### Description  Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context  Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

) ### Description  When building ORT on windows with cuda 12.8, there were compile errors and log was prompting `To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)` This PR * enables `-rdc=true` ([Relocatable Device Code (RDC)](https://forums.developer.nvidia.com/t/the-cost-of-relocatable-device-code-rdc-true/47665)) * enable [CUDA_SEPARABLE_COMPILATION](https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html) to support separate compilation of device code * skips the 4505 compiler check, as enabling rdc would init check towards internal linkage and make 4505 warning that treated as error ``` C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed ``` ### Motivation and Context

### Description  Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context  Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

…rosoft#23562) ### Description  When building ORT on windows with cuda 12.8, there were compile errors and log was prompting `To resolve this issue, either use "-rdc=true", or explicitly set "-static-global-template-stub=false" (but see nvcc documentation about downsides of turning it off)` This PR * enables `-rdc=true` ([Relocatable Device Code (RDC)](https://forums.developer.nvidia.com/t/the-cost-of-relocatable-device-code-rdc-true/47665)) * enable [CUDA_SEPARABLE_COMPILATION](https://cmake.org/cmake/help/latest/prop_tgt/CUDA_SEPARABLE_COMPILATION.html) to support separate compilation of device code * skips the 4505 compiler check, as enabling rdc would init check towards internal linkage and make 4505 warning that treated as error ``` C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): error C2220: the following warning is treated as an error [C:\Users\yifanl\Downloads\0202-new-cmake-config\Release\onnxruntime_providers_cuda.vcxproj] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8\include\crt/host_runtime.h(274): warning C4505: '__cudaUnregisterBinaryUtil': unreferenced function with internal linkage has been removed ``` ### Motivation and Context

enable rdc and skip error

0f64714

yf711 changed the title ~~enable rdc and skip error~~ Enable Relocatable Device Code (RDC) to build ORT with cuda 12.8 Feb 3, 2025

yf711 marked this pull request as ready for review February 3, 2025 07:29

yf711 requested a review from snnn February 3, 2025 07:29

tianleiwu reviewed Feb 3, 2025

View reviewed changes

cmake/onnxruntime_providers_cuda.cmake Outdated Show resolved Hide resolved

suppress 221

6181e78

yf711 added 2 commits February 4, 2025 14:10

apply to linux&win

b4539c2

Merge

15cea1b

snnn approved these changes Feb 4, 2025

View reviewed changes

Merge branch 'main' into yifanl/cu128_build

737cd3a

Merge branch 'main' into yifanl/cu128_build

86c619f

yf711 mentioned this pull request Feb 12, 2025

Update cmake_cuda_architecture to control package size #23671

Merged

tianleiwu approved these changes Feb 13, 2025

View reviewed changes

yf711 merged commit c95d828 into main Feb 13, 2025
91 of 106 checks passed

yf711 deleted the yifanl/cu128_build branch February 13, 2025 07:02

Comments

Conversation

yf711 commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

snnn commented Feb 3, 2025

Uh oh!

yf711 commented Feb 4, 2025

Uh oh!

snnn left a comment

Choose a reason for hiding this comment

Uh oh!

tianleiwu commented Feb 4, 2025

Uh oh!

yf711 commented Feb 5, 2025

Uh oh!

snnn commented Feb 5, 2025

Uh oh!

tianleiwu commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yf711 commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianleiwu commented Feb 7, 2025

Uh oh!

snnn commented Feb 11, 2025

Uh oh!

tianleiwu commented Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yf711 commented Feb 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yf711 commented Feb 3, 2025 •

edited

Loading

tianleiwu commented Feb 5, 2025 •

edited

Loading

yf711 commented Feb 5, 2025 •

edited

Loading

tianleiwu commented Feb 11, 2025 •

edited

Loading