-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[CI] Add basic CUDA 13.0 periodic test #161013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161013
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New Failure, 1 Unrelated FailureAs of commit 767c0b6 with merge base 4774208 ( NEW FAILURE - The following job has failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| needs: get-label-type | ||
| with: | ||
| runner_prefix: "${{ needs.get-label-type.outputs.label-type }}" | ||
| cuda-arch-list: 7.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to how you fixed the cuda 13.0 vs cuda 13.
It would slightly better to use string form '7.5' here for future upgrade purposes (e.g. if we want to update 7.5 to 10.0, we would be prone to make it 10.0, which may cause "sm10 not recognized"). Using '7.5' would make future upgrade to "X.0" more safe, preventing the truncation of the ".0". So let's try to make it a string.
|
@tinglvv looks like current issue is: |
Indeed this API is deprecated in CUDA 13:
All we can do is to patch it: diff --git a/c10/cuda/driver_api.cpp b/c10/cuda/driver_api.cpp
index f936b02ec9a..4b135bcce65 100644
--- a/c10/cuda/driver_api.cpp
+++ b/c10/cuda/driver_api.cpp
@@ -62,10 +62,13 @@ void* get_symbol(const char* name, int version) {
#endif
// This fallback to the old API to try getting the symbol again.
+ // As of CUDA 13, this API is deprecated.
+#if defined(CUDA_VERSION) && (CUDA_VERSION < 13000)
if (auto st = cudaGetDriverEntryPoint(name, &out, cudaEnableDefault, &qres);
st == cudaSuccess && qres == cudaDriverEntryPointSuccess && out) {
return out;
}
+#endif
// If the symbol cannot be resolved, report and return nullptr;
// the caller is responsible for checking the pointer. |
|
@tinglvv and @Aidyn-A looks like next issue: |
|
From @Aidyn-A Adding the COMPILE_FLAGS -Wno-deprecated-declarations to unblock the build. @ptrblck also mentioned to maybe guard the deprecated API to avoid compiling it with 13+. |
|
Interestingly, the normal CD binary build does not have this deprecation warning error - https://github.com/pytorch/pytorch/actions/runs/17147872787/job/48647552615 In the CI testing build, the -Werror promotes all warnings including deprecations to errors, should we use the same settings as the binary build? |
|
Looks like still similar errors: |
8a63390 to
54845c4
Compare
|
Pushed a fix to whitelist the files that include However the build might still fail, since I also see the error with NVSHMEM on sm_75, I believe the update NVSHMEM to 3.3.24 PR needs to be merged first #161321 to resolve this error. |
|
@pytorchmergebot rebase -b main |
|
@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here |
|
Successfully rebased |
54845c4 to
96c09a7
Compare
|
Hi @tinglvv and @Aidyn-A looks like same issue still in : https://github.com/pytorch/pytorch/actions/runs/17240342345/job/48915542390?pr=161013 I believe all cu files in |
|
@tinglvv Since we can't control the third party submodules, I think it is fair to consider them SYSTEM headers. See SYSTEM keyword in https://cmake.org/cmake/help/latest/command/target_include_directories.html. That might help in suppressing the warnings from these headers. I would have tested it myself but I can't seem to reproduce the warning you are seeing... diff --git a/aten/src/ATen/CMakeLists.txt b/aten/src/ATen/CMakeLists.txt
index d8787154a21..bf8f262537b 100644
--- a/aten/src/ATen/CMakeLists.txt
+++ b/aten/src/ATen/CMakeLists.txt
@@ -216,7 +216,7 @@ file(GLOB mem_eff_attention_cuda_cpp "native/transformers/cuda/mem_eff_attention
if(USE_CUDA AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
add_library(flash_attention OBJECT EXCLUDE_FROM_ALL ${flash_attention_cuda_kernels_cu} ${flash_attention_cuda_cpp})
- target_include_directories(flash_attention PUBLIC
+ target_include_directories(flash_attention SYSTEM PUBLIC
${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc
${PROJECT_SOURCE_DIR}/third_party/flash-attention/include
${PROJECT_SOURCE_DIR}/third_party/cutlass/include
diff --git a/caffe2/CMakeLists.txt b/caffe2/CMakeLists.txt
index 3b7e9852a5d..f7a8e2d893a 100644
--- a/caffe2/CMakeLists.txt
+++ b/caffe2/CMakeLists.txt
@@ -1062,7 +1062,7 @@ elseif(USE_CUDA)
UNFUSE_FMA # Addressing issue #121558
)
target_sources(torch_cuda PRIVATE $<TARGET_OBJECTS:flash_attention>)
- target_include_directories(torch_cuda PUBLIC
+ target_include_directories(torch_cuda SYSTEM PUBLIC
$<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc>
$<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/flash-attention/include>
$<BUILD_INTERFACE:${PROJECT_SOURCE_DIR}/third_party/cutlass/include> |
|
Successfully rebased |
011deae to
767c0b6
Compare
|
This is existing failure: |
|
This one as well: |
|
@pytorchmergebot merge -f "all looks good" |
|
Errors are not related to this change: |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: PR #161013 has not been reviewed yet |
|
@pytorchmergebot merge -f "all looks good" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
pytorch#159779 Pull Request resolved: pytorch#161013 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <[email protected]> Co-authored-by: Aidyn-A <[email protected]>
pytorch#159779 Pull Request resolved: pytorch#161013 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <[email protected]> Co-authored-by: Aidyn-A <[email protected]>
I think this is just a copy paste error? NS: Introduced by #161013 Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners Pull Request resolved: #165183 Approved by: https://github.com/malfet
…65183) I think this is just a copy paste error? NS: Introduced by pytorch#161013 Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners Pull Request resolved: pytorch#165183 Approved by: https://github.com/malfet
#159779
cc @atalman