[CI] Add basic CUDA 13.0 periodic test #161013

tinglvv · 2025-08-19T22:41:04Z

#159779

cc @atalman

pytorch-bot · 2025-08-19T22:41:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161013

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm CI/CD workflows failing due to : download from https://api.github.com/repos/pytorch/pytorch timed out.

❌ 1 New Failure, 1 Unrelated Failure

As of commit 767c0b6 with merge base 4774208 ():

NEW FAILURE - The following job has failed:

periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 5, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
export/test_serialize 1/1 failed!

FLAKY - The following job failed but was likely due to flakiness present on trunk:

periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh) (disabled by #156776)
dynamo/test_repros.py::ReproTests::test_dataclass_in_module

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nWEIdia · 2025-08-22T01:54:16Z

.github/workflows/periodic.yml

+    needs: get-label-type
+    with:
+      runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
+      cuda-arch-list: 7.5


Similar to how you fixed the cuda 13.0 vs cuda 13.
It would slightly better to use string form '7.5' here for future upgrade purposes (e.g. if we want to update 7.5 to 10.0, we would be prone to make it 10.0, which may cause "sm10 not recognized"). Using '7.5' would make future upgrade to "X.0" more safe, preventing the truncation of the ".0". So let's try to make it a string.

atalman · 2025-08-22T12:24:51Z

@tinglvv looks like current issue is:

/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp: In function ‘void* c10::cuda::{anonymous}::get_symbol(const char*, int)’:
/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:65:40: error: ‘cudaError_t cudaGetDriverEntryPoint(const char*, void**, long long unsigned int, cudaDriverEntryPointQueryResult*)’ is deprecated [-Werror=deprecated-declarations]
   65 |   if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/cuda/include/channel_descriptor.h:61,
                 from /usr/local/cuda/include/cuda_runtime.h:94,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAMiscFunctions.h:6,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAException.h:5,
                 from /var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:2:
/usr/local/cuda/include/cuda_runtime_api.h:13101:57: note: declared here
13101 | extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaGetDriverEntryPoint(const char *symbol, void **funcPtr, unsigned long long flags, enum cudaDriverEntryPointQueryResult *driverStatus = NULL);
      |                                                         ^~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Aidyn-A · 2025-08-22T12:54:15Z

@tinglvv looks like current issue is:

/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp: In function ‘void* c10::cuda::{anonymous}::get_symbol(const char*, int)’:
/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:65:40: error: ‘cudaError_t cudaGetDriverEntryPoint(const char*, void**, long long unsigned int, cudaDriverEntryPointQueryResult*)’ is deprecated [-Werror=deprecated-declarations]
   65 |   if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/cuda/include/channel_descriptor.h:61,
                 from /usr/local/cuda/include/cuda_runtime.h:94,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAMiscFunctions.h:6,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAException.h:5,
                 from /var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:2:
/usr/local/cuda/include/cuda_runtime_api.h:13101:57: note: declared here
13101 | extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaGetDriverEntryPoint(const char *symbol, void **funcPtr, unsigned long long flags, enum cudaDriverEntryPointQueryResult *driverStatus = NULL);
      |                                                         ^~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Indeed this API is deprecated in CUDA 13:

This API is deprecated and cudaGetDriverEntryPointByVersion (with a hardcoded cudaVersion) should be used instead.

All we can do is to patch it:

diff --git a/c10/cuda/driver_api.cpp b/c10/cuda/driver_api.cpp
index f936b02ec9a..4b135bcce65 100644
--- a/c10/cuda/driver_api.cpp
+++ b/c10/cuda/driver_api.cpp
@@ -62,10 +62,13 @@ void* get_symbol(const char* name, int version) {
 #endif
 
   // This fallback to the old API to try getting the symbol again.
+  // As of CUDA 13, this API is deprecated.
+#if defined(CUDA_VERSION) && (CUDA_VERSION < 13000)
   if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
       st == cudaSuccess && qres == cudaDriverEntryPointSuccess && out) {
     return out;
   }
+#endif
 
   // If the symbol cannot be resolved, report and return nullptr;
   // the caller is responsible for checking the pointer.

atalman · 2025-08-22T15:17:34Z

@tinglvv and @Aidyn-A looks like next issue:

/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:599:33: error: ‘long4’ is deprecated: use long4_16a or long4_32a [-Werror=deprecated-declarations]
  599 | struct alignment_of<long4> {
      |                                 ^    
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:530:98: note: declared here
  530 | typedef __device_builtin__ struct long4 __VECTOR_TYPE_DEPRECATED__("use long4_16a or long4_32a") long4;
      |                                                                                                  ^~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:603:33: error: ‘ulong4’ is deprecated: use ulong4_16a or ulong4_32a [-Werror=deprecated-declarations]
  603 | struct alignment_of<ulong4> {
      |                                 ^     
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:531:101: note: declared here
  531 | typedef __device_builtin__ struct ulong4 __VECTOR_TYPE_DEPRECATED__("use ulong4_16a or ulong4_32a") ulong4;
      |                                                                                                     ^~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:619:33: error: ‘longlong4’ is deprecated: use longlong4_16a or longlong4_32a [-Werror=deprecated-declarations]
  619 | struct alignment_of<longlong4> {
      |                                 ^        
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:548:110: note: declared here
  548 | typedef __device_builtin__ struct longlong4 __VECTOR_TYPE_DEPRECATED__("use longlong4_16a or longlong4_32a") longlong4;
      |                                                                                                              ^~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:623:33: error: ‘ulonglong4’ is deprecated: use ulonglong4_16a or ulonglong4_32a [-Werror=deprecated-declarations]
  623 | struct alignment_of<ulonglong4> {
      |                                 ^         
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:549:113: note: declared here
  549 | typedef __device_builtin__ struct ulonglong4 __VECTOR_TYPE_DEPRECATED__("use ulonglong4_16a or ulonglong4_32a") ulonglong4;
      |                                                                                                                 ^~~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:627:33: error: ‘double4’ is deprecated: use double4_16a or double4_32a [-Werror=deprecated-declarations]
  627 | struct alignment_of<double4> {
      |                                 ^      
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:559:104: note: declared here
  559 | typedef __device_builtin__ struct double4 __VECTOR_TYPE_DEPRECATED__("use double4_16a or double4_32a") double4;
      |

tinglvv · 2025-08-22T17:54:57Z

From @Aidyn-A
"This is something we cannot patch easily, as the warning originates in CUTLASS header. They did not replace the deprecated type yet https://github.com/NVIDIA/cutlass/blob/11cad1f67b36879934ea75383d9323296b6dd45b/include/cutlass/platform/platform.h#L626-L629"

Adding the COMPILE_FLAGS -Wno-deprecated-declarations to unblock the build. @ptrblck also mentioned to maybe guard the deprecated API to avoid compiling it with 13+.

tinglvv · 2025-08-22T18:47:52Z

Interestingly, the normal CD binary build does not have this deprecation warning error - https://github.com/pytorch/pytorch/actions/runs/17147872787/job/48647552615
Build log reads:

2025-08-22T06:38:41.8979667Z --   CMake version         : 4.1.0
2025-08-22T06:38:41.8980289Z --   CMake command         : /opt/_internal/cpython-3.10.18/lib/python3.10/site-packages/cmake/data/bin/cmake
2025-08-22T06:38:41.8980930Z --   System                : Linux
2025-08-22T06:38:41.8981360Z --   C++ compiler          : /opt/rh/gcc-toolset-13/root/usr/bin/c++
2025-08-22T06:38:41.8981821Z --   C++ compiler id       : GNU
2025-08-22T06:38:41.8982169Z --   C++ compiler version  : 13.3.1
2025-08-22T06:38:41.8982512Z --   Using ccache if found : ON
2025-08-22T06:38:41.8982897Z --   Found ccache          : CCACHE_PROGRAM-NOTFOUND
2025-08-22T06:38:41.8986831Z --   CXX flags             :  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow

In the CI testing build, the -Werror promotes all warnings including deprecations to errors, should we use the same settings as the binary build?

2025-08-22T14:26:12.7750481Z --   CMake version         : 4.0.0
2025-08-22T14:26:12.7751272Z --   CMake command         : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/cmake/data/bin/cmake
2025-08-22T14:26:12.7752112Z --   System                : Linux
2025-08-22T14:26:12.7752629Z --   C++ compiler          : /opt/cache/bin/c++
2025-08-22T14:26:12.7753156Z --   C++ compiler id       : GNU
2025-08-22T14:26:12.7753612Z --   C++ compiler version  : 11.4.0
2025-08-22T14:26:12.7754079Z --   Using ccache if found : ON
2025-08-22T14:26:12.7754543Z --   Found ccache          : CCACHE_PROGRAM-NOTFOUND
2025-08-22T14:26:12.7759895Z --   CXX flags             :  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Werror -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow

atalman · 2025-08-25T16:15:08Z

Looks like still similar errors:

/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:599:33: error: ‘long4’ is deprecated: use long4_16a or long4_32a [-Werror=deprecated-declarations]
  599 | struct alignment_of<long4> {
      |                                 ^    
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:530:98: note: declared here
  530 | typedef __device_builtin__ struct long4 __VECTOR_TYPE_DEPRECATED__("use long4_16a or long4_32a") long4;
      |                                                                                                  ^~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:603:33: error: ‘ulong4’ is deprecated: use ulong4_16a or ulong4_32a [-Werror=deprecated-declarations]
  603 | struct alignment_of<ulong4> {
      |                                 ^     
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:531:101: note: declared here
  531 | typedef __device_builtin__ struct ulong4 __VECTOR_TYPE_DEPRECATED__("use ulong4_16a or ulong4_32a") ulong4;
      |                                                                                                     ^~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:619:33: error: ‘longlong4’ is deprecated: use longlong4_16a or longlong4_32a [-Werror=deprecated-declarations]
  619 | struct alignment_of<longlong4> {
      |                                 ^        
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:548:110: note: declared here
  548 | typedef __device_builtin__ struct longlong4 __VECTOR_TYPE_DEPRECATED__("use longlong4_16a or longlong4_32a") longlong4;
      |                                                                                                              ^~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:623:33: error: ‘ulonglong4’ is deprecated: use ulonglong4_16a or ulonglong4_32a [-Werror=deprecated-declarations]
  623 | struct alignment_of<ulonglong4> {
      |                                 ^         
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:549:113: note: declared here
  549 | typedef __device_builtin__ struct ulonglong4 __VECTOR_TYPE_DEPRECATED__("use ulonglong4_16a or ulonglong4_32a") ulonglong4;
      |                                                                                                                 ^~~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:627:33: error: ‘double4’ is deprecated: use double4_16a or double4_32a [-Werror=deprecated-declarations]
  627 | struct alignment_of<double4> {
      |                                 ^      
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:559:104: note: declared here
  559 | typedef __device_builtin__ struct double4 __VECTOR_TYPE_DEPRECATED__("use double4_16a or double4_32a") double4;
      |                                                                                                        ^~~~~~~
cc1plus: all warnings being treated as errors
sccache: Compiler killed by signal 1

tinglvv · 2025-08-26T07:41:15Z

Pushed a fix to whitelist the files that include third_party/cutlass/include/cutlass/platform/platform.h

However the build might still fail, since I also see the error with NVSHMEM on sm_75, I believe the update NVSHMEM to 3.3.24 PR needs to be merged first #161321 to resolve this error.

[5643/8076] Linking CXX shared library CMakeFiles/torch_nvshmem.dir/cmake_device_link.o
FAILED: CMakeFiles/torch_nvshmem.dir/cmake_device_link.o 
/opt/cache/lib/nvcc -forward-unknown-to-host-compiler -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_75,code=sm_75 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Xfatbin -compress-all -Xcompiler -Werror -Xcompiler -Wno-error=sign-compare  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -DC10_NODEPRECATED -O3 -DNDEBUG  -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/cuda/utils.cpp.o

atalman · 2025-08-26T13:52:31Z

@pytorchmergebot rebase -b main

pytorchmergebot · 2025-08-26T13:54:00Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-08-26T13:54:03Z

Successfully rebased cu13-periodic-test onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout cu13-periodic-test && git pull --rebase)

atalman · 2025-08-26T16:51:57Z

Hi @tinglvv and @Aidyn-A looks like same issue still in : https://github.com/pytorch/pytorch/actions/runs/17240342345/job/48915542390?pr=161013

I believe all cu files in flash_attn/src as example flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu needs to be included as well

lakshayg · 2025-08-26T20:45:13Z

@tinglvv Since we can't control the third party submodules, I think it is fair to consider them SYSTEM headers. See SYSTEM keyword in https://cmake.org/cmake/help/latest/command/target_include_directories.html. That might help in suppressing the warnings from these headers.

I would have tested it myself but I can't seem to reproduce the warning you are seeing...

diff --git a/aten/src/ATen/CMakeLists.txt b/aten/src/ATen/CMakeLists.txt
index d8787154a21..bf8f262537b 100644
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@@ -216,7 +216,7 @@ file(GLOB mem_eff_attention_cuda_cpp "native/transformers/cuda/mem_eff_attention
 if(USE_CUDA AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
   add_library(flash_attention OBJECT EXCLUDE_FROM_ALL ${flash_attention_cuda_kernels_cu} ${flash_attention_cuda_cpp})

-  target_include_directories(flash_attention PUBLIC
+  target_include_directories(flash_attention SYSTEM PUBLIC
     ${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc
     ${PROJECT_SOURCE_DIR}/third_party/flash-attention/include
     ${PROJECT_SOURCE_DIR}/third_party/cutlass/include
diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index 3b7e9852a5d..f7a8e2d893a 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -1062,7 +1062,7 @@ elseif(USE_CUDA)
         UNFUSE_FMA                      # Addressing issue #121558
       )
     target_sources(torch_cuda PRIVATE $<TARGET_OBJECTS:flash_attention>)
-    target_include_directories(torch_cuda PUBLIC
+    target_include_directories(torch_cuda SYSTEM PUBLIC
       $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc>
       $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/flash-attention/include>
       $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/cutlass/include>

pytorchmergebot · 2025-08-28T23:15:00Z

Successfully rebased cu13-periodic-test onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout cu13-periodic-test && git pull --rebase)

atalman · 2025-08-29T17:17:47Z

This is existing failure:
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 5, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
export/test_serialize 1/1 failed!

atalman · 2025-08-29T17:19:24Z

This one as well:
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
dynamo/test_repros.py::ReproTests::test_dataclass_in_module

atalman · 2025-08-29T17:19:42Z

@pytorchmergebot merge -f "all looks good"

tinglvv · 2025-08-29T17:20:05Z

Errors are not related to this change:
Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/export/test_serialize.py", line 73, in <module> class TestSerialize(TestCase): File "/var/lib/jenkins/workspace/test/export/test_serialize.py", line 597, in TestSerialize not torch.cuda.is_available() or not has_triton(), "requires cuda and triton" NameError: name 'has_triton' is not defined

pytorchmergebot · 2025-08-29T17:21:17Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-08-29T17:21:30Z

Merge failed

Reason: PR #161013 has not been reviewed yet

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

atalman · 2025-08-29T17:54:44Z

@pytorchmergebot merge -f "all looks good"

pytorchmergebot · 2025-08-29T17:56:16Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch#159779 Pull Request resolved: pytorch#161013 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <[email protected]> Co-authored-by: Aidyn-A <[email protected]>

I think this is just a copy paste error? NS: Introduced by #161013 Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners Pull Request resolved: #165183 Approved by: https://github.com/malfet

…65183) I think this is just a copy paste error? NS: Introduced by pytorch#161013 Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners Pull Request resolved: pytorch#165183 Approved by: https://github.com/malfet

tinglvv requested review from a team and jeffdaily as code owners August 19, 2025 22:41

pytorch-bot bot added the topic: not user facing topic category label Aug 19, 2025

pytorchbot added the open source label Aug 19, 2025

tinglvv mentioned this pull request Aug 19, 2025

Enable CUDA 13.0 binaries #159779

Closed

15 tasks

atalman added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ci-no-td Do not run TD on this PR keep-going Don't stop on first failure, keep running tests until the end labels Aug 20, 2025

janeyx99 requested a review from atalman August 20, 2025 21:43

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 20, 2025

nWEIdia reviewed Aug 22, 2025

View reviewed changes

atalman requested review from eqy and syed-ahmed as code owners August 22, 2025 13:48

tinglvv added this to PyTorch + CUDA Aug 22, 2025

tinglvv moved this to In Progress in PyTorch + CUDA Aug 22, 2025

tinglvv force-pushed the cu13-periodic-test branch from 8a63390 to 54845c4 Compare August 26, 2025 07:38

pytorchmergebot force-pushed the cu13-periodic-test branch from 54845c4 to 96c09a7 Compare August 26, 2025 13:54

tinglvv and others added 5 commits August 28, 2025 23:14

Whitelist platform.h with no deprecation warnings

15857c5

whitelist all files the use platform.h

2a6c38b

Mark third-party cutlass and flash-attn as SYSTEM headers

96c4478

Replace double4 in cuda_vectorized_test.cu

086af2c

use double4_16a in CUDA 13 only

767c0b6

pytorchmergebot force-pushed the cu13-periodic-test branch from 011deae to 767c0b6 Compare August 28, 2025 23:15

pytorchmergebot added the merging label Aug 29, 2025

pytorchmergebot removed the merging label Aug 29, 2025

atalman approved these changes Aug 29, 2025

View reviewed changes

pytorchmergebot added the merging label Aug 29, 2025

pytorchmergebot closed this in 303f514 Aug 29, 2025

pytorchmergebot added the Merged label Aug 29, 2025

github-project-automation bot moved this from In Progress to Done in PyTorch + CUDA Aug 29, 2025

pytorchmergebot removed the merging label Aug 29, 2025

atalman removed this from PyTorch + CUDA Sep 26, 2025

malfet mentioned this pull request Oct 10, 2025

[CI] Put the no gpu tests on machines that don't have gpus #165183

Closed

[CI] Add basic CUDA 13.0 periodic test #161013

[CI] Add basic CUDA 13.0 periodic test #161013

Uh oh!

Conversation

tinglvv commented Aug 19, 2025

Uh oh!

pytorch-bot bot commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161013

❗ 1 Active SEVs

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

nWEIdia Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atalman commented Aug 22, 2025

Uh oh!

Aidyn-A commented Aug 22, 2025

Uh oh!

atalman commented Aug 22, 2025

Uh oh!

tinglvv commented Aug 22, 2025

Uh oh!

tinglvv commented Aug 22, 2025

Uh oh!

atalman commented Aug 25, 2025

Uh oh!

tinglvv commented Aug 26, 2025

Uh oh!

atalman commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

pytorchmergebot commented Aug 26, 2025

Uh oh!

atalman commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lakshayg commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Aug 28, 2025

Uh oh!

atalman commented Aug 29, 2025

Uh oh!

atalman commented Aug 29, 2025

Uh oh!

atalman commented Aug 29, 2025

Uh oh!

tinglvv commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge started

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge failed

Uh oh!

atalman commented Aug 29, 2025

Uh oh!

pytorchmergebot commented Aug 29, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pytorch-bot bot commented Aug 19, 2025 •

edited

Loading

nWEIdia Aug 22, 2025 •

edited

Loading

atalman commented Aug 26, 2025 •

edited

Loading

lakshayg commented Aug 26, 2025 •

edited

Loading