Skip to content

Conversation

@tinglvv
Copy link
Collaborator

@tinglvv tinglvv commented Aug 19, 2025

@tinglvv tinglvv requested review from a team and jeffdaily as code owners August 19, 2025 22:41
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161013

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Unrelated Failure

As of commit 767c0b6 with merge base 4774208 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Aug 19, 2025
@tinglvv tinglvv mentioned this pull request Aug 19, 2025
15 tasks
@atalman atalman added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ci-no-td Do not run TD on this PR keep-going Don't stop on first failure, keep running tests until the end labels Aug 20, 2025
@janeyx99 janeyx99 requested a review from atalman August 20, 2025 21:43
@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 20, 2025
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
cuda-arch-list: 7.5
Copy link
Collaborator

@nWEIdia nWEIdia Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to how you fixed the cuda 13.0 vs cuda 13.
It would slightly better to use string form '7.5' here for future upgrade purposes (e.g. if we want to update 7.5 to 10.0, we would be prone to make it 10.0, which may cause "sm10 not recognized"). Using '7.5' would make future upgrade to "X.0" more safe, preventing the truncation of the ".0". So let's try to make it a string.

@atalman
Copy link
Contributor

atalman commented Aug 22, 2025

@tinglvv looks like current issue is:

/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp: In function ‘void* c10::cuda::{anonymous}::get_symbol(const char*, int)’:
/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:65:40: error: ‘cudaError_t cudaGetDriverEntryPoint(const char*, void**, long long unsigned int, cudaDriverEntryPointQueryResult*)’ is deprecated [-Werror=deprecated-declarations]
   65 |   if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/cuda/include/channel_descriptor.h:61,
                 from /usr/local/cuda/include/cuda_runtime.h:94,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAMiscFunctions.h:6,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAException.h:5,
                 from /var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:2:
/usr/local/cuda/include/cuda_runtime_api.h:13101:57: note: declared here
13101 | extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaGetDriverEntryPoint(const char *symbol, void **funcPtr, unsigned long long flags, enum cudaDriverEntryPointQueryResult *driverStatus = NULL);
      |                                                         ^~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

@Aidyn-A
Copy link
Collaborator

Aidyn-A commented Aug 22, 2025

@tinglvv looks like current issue is:

/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp: In function ‘void* c10::cuda::{anonymous}::get_symbol(const char*, int)’:
/var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:65:40: error: ‘cudaError_t cudaGetDriverEntryPoint(const char*, void**, long long unsigned int, cudaDriverEntryPointQueryResult*)’ is deprecated [-Werror=deprecated-declarations]
   65 |   if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
      |                 ~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /usr/local/cuda/include/channel_descriptor.h:61,
                 from /usr/local/cuda/include/cuda_runtime.h:94,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAMiscFunctions.h:6,
                 from /var/lib/jenkins/workspace/c10/cuda/CUDAException.h:5,
                 from /var/lib/jenkins/workspace/c10/cuda/driver_api.cpp:2:
/usr/local/cuda/include/cuda_runtime_api.h:13101:57: note: declared here
13101 | extern __CUDA_DEPRECATED __host__ cudaError_t CUDARTAPI cudaGetDriverEntryPoint(const char *symbol, void **funcPtr, unsigned long long flags, enum cudaDriverEntryPointQueryResult *driverStatus = NULL);
      |                                                         ^~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors

Indeed this API is deprecated in CUDA 13:

This API is deprecated and cudaGetDriverEntryPointByVersion (with a hardcoded cudaVersion) should be used instead.

All we can do is to patch it:

diff --git a/c10/cuda/driver_api.cpp b/c10/cuda/driver_api.cpp
index f936b02ec9a..4b135bcce65 100644
--- a/c10/cuda/driver_api.cpp
+++ b/c10/cuda/driver_api.cpp
@@ -62,10 +62,13 @@ void* get_symbol(const char* name, int version) {
 #endif
 
   // This fallback to the old API to try getting the symbol again.
+  // As of CUDA 13, this API is deprecated.
+#if defined(CUDA_VERSION) && (CUDA_VERSION < 13000)
   if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
       st == cudaSuccess && qres == cudaDriverEntryPointSuccess && out) {
     return out;
   }
+#endif
 
   // If the symbol cannot be resolved, report and return nullptr;
   // the caller is responsible for checking the pointer.

@atalman atalman requested review from eqy and syed-ahmed as code owners August 22, 2025 13:48
@atalman
Copy link
Contributor

atalman commented Aug 22, 2025

@tinglvv and @Aidyn-A looks like next issue:

/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:599:33: error: ‘long4’ is deprecated: use long4_16a or long4_32a [-Werror=deprecated-declarations]
  599 | struct alignment_of<long4> {
      |                                 ^    
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:530:98: note: declared here
  530 | typedef __device_builtin__ struct long4 __VECTOR_TYPE_DEPRECATED__("use long4_16a or long4_32a") long4;
      |                                                                                                  ^~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:603:33: error: ‘ulong4’ is deprecated: use ulong4_16a or ulong4_32a [-Werror=deprecated-declarations]
  603 | struct alignment_of<ulong4> {
      |                                 ^     
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:531:101: note: declared here
  531 | typedef __device_builtin__ struct ulong4 __VECTOR_TYPE_DEPRECATED__("use ulong4_16a or ulong4_32a") ulong4;
      |                                                                                                     ^~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:619:33: error: ‘longlong4’ is deprecated: use longlong4_16a or longlong4_32a [-Werror=deprecated-declarations]
  619 | struct alignment_of<longlong4> {
      |                                 ^        
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:548:110: note: declared here
  548 | typedef __device_builtin__ struct longlong4 __VECTOR_TYPE_DEPRECATED__("use longlong4_16a or longlong4_32a") longlong4;
      |                                                                                                              ^~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:623:33: error: ‘ulonglong4’ is deprecated: use ulonglong4_16a or ulonglong4_32a [-Werror=deprecated-declarations]
  623 | struct alignment_of<ulonglong4> {
      |                                 ^         
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:549:113: note: declared here
  549 | typedef __device_builtin__ struct ulonglong4 __VECTOR_TYPE_DEPRECATED__("use ulonglong4_16a or ulonglong4_32a") ulonglong4;
      |                                                                                                                 ^~~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:627:33: error: ‘double4’ is deprecated: use double4_16a or double4_32a [-Werror=deprecated-declarations]
  627 | struct alignment_of<double4> {
      |                                 ^      
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:559:104: note: declared here
  559 | typedef __device_builtin__ struct double4 __VECTOR_TYPE_DEPRECATED__("use double4_16a or double4_32a") double4;
      |                                                      

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 22, 2025

From @Aidyn-A
"This is something we cannot patch easily, as the warning originates in CUTLASS header. They did not replace the deprecated type yet https://github.com/NVIDIA/cutlass/blob/11cad1f67b36879934ea75383d9323296b6dd45b/include/cutlass/platform/platform.h#L626-L629"

Adding the COMPILE_FLAGS -Wno-deprecated-declarations to unblock the build. @ptrblck also mentioned to maybe guard the deprecated API to avoid compiling it with 13+.

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 22, 2025

Interestingly, the normal CD binary build does not have this deprecation warning error - https://github.com/pytorch/pytorch/actions/runs/17147872787/job/48647552615
Build log reads:

2025-08-22T06:38:41.8979667Z --   CMake version         : 4.1.0
2025-08-22T06:38:41.8980289Z --   CMake command         : /opt/_internal/cpython-3.10.18/lib/python3.10/site-packages/cmake/data/bin/cmake
2025-08-22T06:38:41.8980930Z --   System                : Linux
2025-08-22T06:38:41.8981360Z --   C++ compiler          : /opt/rh/gcc-toolset-13/root/usr/bin/c++
2025-08-22T06:38:41.8981821Z --   C++ compiler id       : GNU
2025-08-22T06:38:41.8982169Z --   C++ compiler version  : 13.3.1
2025-08-22T06:38:41.8982512Z --   Using ccache if found : ON
2025-08-22T06:38:41.8982897Z --   Found ccache          : CCACHE_PROGRAM-NOTFOUND
2025-08-22T06:38:41.8986831Z --   CXX flags             :  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow

In the CI testing build, the -Werror promotes all warnings including deprecations to errors, should we use the same settings as the binary build?

2025-08-22T14:26:12.7750481Z --   CMake version         : 4.0.0
2025-08-22T14:26:12.7751272Z --   CMake command         : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/cmake/data/bin/cmake
2025-08-22T14:26:12.7752112Z --   System                : Linux
2025-08-22T14:26:12.7752629Z --   C++ compiler          : /opt/cache/bin/c++
2025-08-22T14:26:12.7753156Z --   C++ compiler id       : GNU
2025-08-22T14:26:12.7753612Z --   C++ compiler version  : 11.4.0
2025-08-22T14:26:12.7754079Z --   Using ccache if found : ON
2025-08-22T14:26:12.7754543Z --   Found ccache          : CCACHE_PROGRAM-NOTFOUND
2025-08-22T14:26:12.7759895Z --   CXX flags             :  -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Werror -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow

@tinglvv tinglvv moved this to In Progress in PyTorch + CUDA Aug 22, 2025
@atalman
Copy link
Contributor

atalman commented Aug 25, 2025

Looks like still similar errors:

/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:599:33: error: ‘long4’ is deprecated: use long4_16a or long4_32a [-Werror=deprecated-declarations]
  599 | struct alignment_of<long4> {
      |                                 ^    
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:530:98: note: declared here
  530 | typedef __device_builtin__ struct long4 __VECTOR_TYPE_DEPRECATED__("use long4_16a or long4_32a") long4;
      |                                                                                                  ^~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:603:33: error: ‘ulong4’ is deprecated: use ulong4_16a or ulong4_32a [-Werror=deprecated-declarations]
  603 | struct alignment_of<ulong4> {
      |                                 ^     
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:531:101: note: declared here
  531 | typedef __device_builtin__ struct ulong4 __VECTOR_TYPE_DEPRECATED__("use ulong4_16a or ulong4_32a") ulong4;
      |                                                                                                     ^~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:619:33: error: ‘longlong4’ is deprecated: use longlong4_16a or longlong4_32a [-Werror=deprecated-declarations]
  619 | struct alignment_of<longlong4> {
      |                                 ^        
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:548:110: note: declared here
  548 | typedef __device_builtin__ struct longlong4 __VECTOR_TYPE_DEPRECATED__("use longlong4_16a or longlong4_32a") longlong4;
      |                                                                                                              ^~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:623:33: error: ‘ulonglong4’ is deprecated: use ulonglong4_16a or ulonglong4_32a [-Werror=deprecated-declarations]
  623 | struct alignment_of<ulonglong4> {
      |                                 ^         
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:549:113: note: declared here
  549 | typedef __device_builtin__ struct ulonglong4 __VECTOR_TYPE_DEPRECATED__("use ulonglong4_16a or ulonglong4_32a") ulonglong4;
      |                                                                                                                 ^~~~~~~~~~
/var/lib/jenkins/workspace/third_party/cutlass/include/cutlass/platform/platform.h:627:33: error: ‘double4’ is deprecated: use double4_16a or double4_32a [-Werror=deprecated-declarations]
  627 | struct alignment_of<double4> {
      |                                 ^      
In file included from /usr/local/cuda-13.0/targets/x86_64-linux/include/driver_types.h:61,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/builtin_types.h:59,
                 from /usr/local/cuda/bin/../targets/x86_64-linux/include/crt/host_runtime.h:58,
                 from /usr/lib/gcc/x86_64-linux-gnu/11/include/stddef.h:213:
/usr/local/cuda-13.0/targets/x86_64-linux/include/vector_types.h:559:104: note: declared here
  559 | typedef __device_builtin__ struct double4 __VECTOR_TYPE_DEPRECATED__("use double4_16a or double4_32a") double4;
      |                                                                                                        ^~~~~~~
cc1plus: all warnings being treated as errors
sccache: Compiler killed by signal 1

@tinglvv tinglvv force-pushed the cu13-periodic-test branch from 8a63390 to 54845c4 Compare August 26, 2025 07:38
@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 26, 2025

Pushed a fix to whitelist the files that include third_party/cutlass/include/cutlass/platform/platform.h

However the build might still fail, since I also see the error with NVSHMEM on sm_75, I believe the update NVSHMEM to 3.3.24 PR needs to be merged first #161321 to resolve this error.

[5643/8076] Linking CXX shared library CMakeFiles/torch_nvshmem.dir/cmake_device_link.o
FAILED: CMakeFiles/torch_nvshmem.dir/cmake_device_link.o 
/opt/cache/lib/nvcc -forward-unknown-to-host-compiler -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_75,code=sm_75 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Xfatbin -compress-all -Xcompiler -Werror -Xcompiler -Wno-error=sign-compare  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -DC10_NODEPRECATED -O3 -DNDEBUG  -Xcompiler=-fPIC -Wno-deprecated-gpu-targets -shared -dlink caffe2/CMakeFiles/torch_nvshmem.dir/__/torch/csrc/distributed/c10d/cuda/utils.cpp.o 

@atalman
Copy link
Contributor

atalman commented Aug 26, 2025

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cu13-periodic-test onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout cu13-periodic-test && git pull --rebase)

@atalman
Copy link
Contributor

atalman commented Aug 26, 2025

Hi @tinglvv and @Aidyn-A looks like same issue still in : https://github.com/pytorch/pytorch/actions/runs/17240342345/job/48915542390?pr=161013

I believe all cu files in flash_attn/src as example flash_attn/src/flash_bwd_hdim128_fp16_sm80.cu needs to be included as well

@lakshayg
Copy link
Collaborator

lakshayg commented Aug 26, 2025

@tinglvv Since we can't control the third party submodules, I think it is fair to consider them SYSTEM headers. See SYSTEM keyword in https://cmake.org/cmake/help/latest/command/target_include_directories.html. That might help in suppressing the warnings from these headers.

I would have tested it myself but I can't seem to reproduce the warning you are seeing...

diff --git a/aten/src/ATen/CMakeLists.txt b/aten/src/ATen/CMakeLists.txt
index d8787154a21..bf8f262537b 100644
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@@ -216,7 +216,7 @@ file(GLOB mem_eff_attention_cuda_cpp "native/transformers/cuda/mem_eff_attention
 if(USE_CUDA AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
   add_library(flash_attention OBJECT EXCLUDE_FROM_ALL ${flash_attention_cuda_kernels_cu} ${flash_attention_cuda_cpp})

-  target_include_directories(flash_attention PUBLIC
+  target_include_directories(flash_attention SYSTEM PUBLIC
     ${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc
     ${PROJECT_SOURCE_DIR}/third_party/flash-attention/include
     ${PROJECT_SOURCE_DIR}/third_party/cutlass/include
diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index 3b7e9852a5d..f7a8e2d893a 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -1062,7 +1062,7 @@ elseif(USE_CUDA)
         UNFUSE_FMA                      # Addressing issue #121558
       )
     target_sources(torch_cuda PRIVATE $<TARGET_OBJECTS:flash_attention>)
-    target_include_directories(torch_cuda PUBLIC
+    target_include_directories(torch_cuda SYSTEM PUBLIC
       $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc>
       $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/flash-attention/include>
       $<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/cutlass/include>

@pytorchmergebot
Copy link
Collaborator

Successfully rebased cu13-periodic-test onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout cu13-periodic-test && git pull --rebase)

@atalman
Copy link
Contributor

atalman commented Aug 29, 2025

This is existing failure:
periodic / linux-jammy-cuda12.8-py3.10-gcc9-debug / test (default, 5, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh)
export/test_serialize 1/1 failed!

@atalman
Copy link
Contributor

atalman commented Aug 29, 2025

This one as well:
periodic / linux-jammy-cuda12.4-py3.10-gcc11 / test (legacy_nvidia_driver, 1, 5, lf.linux.4xlarge.nvidia.gpu) (gh)
dynamo/test_repros.py::ReproTests::test_dataclass_in_module

@atalman
Copy link
Contributor

atalman commented Aug 29, 2025

@pytorchmergebot merge -f "all looks good"

@tinglvv
Copy link
Collaborator Author

tinglvv commented Aug 29, 2025

Errors are not related to this change:
Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/export/test_serialize.py", line 73, in <module> class TestSerialize(TestCase): File "/var/lib/jenkins/workspace/test/export/test_serialize.py", line 597, in TestSerialize not torch.cuda.is_available() or not has_triton(), "requires cuda and triton" NameError: name 'has_triton' is not defined

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: PR #161013 has not been reviewed yet

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@atalman
Copy link
Contributor

atalman commented Aug 29, 2025

@pytorchmergebot merge -f "all looks good"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@github-project-automation github-project-automation bot moved this from In Progress to Done in PyTorch + CUDA Aug 29, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
@atalman atalman removed this from PyTorch + CUDA Sep 26, 2025
pytorchmergebot pushed a commit that referenced this pull request Oct 10, 2025
I think this is just a copy paste error?

NS: Introduced by #161013

Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners
Pull Request resolved: #165183
Approved by: https://github.com/malfet
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Oct 21, 2025
…65183)

I think this is just a copy paste error?

NS: Introduced by pytorch#161013

Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners
Pull Request resolved: pytorch#165183
Approved by: https://github.com/malfet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR keep-going Don't stop on first failure, keep running tests until the end Merged open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants