Skip to content

Commit 9234f50

Browse files
lwfacebook-github-bot
authored andcommitted
Make WorkNCCL use CUDAEvent::query() rather than re-implement it (#49343)
Summary: Pull Request resolved: #49343 at::cuda::CUDAEvent is "lazy" and only creates an event when it's first recorded. Until then, at::cuda::CUDAEvent is empty. If we use at::cuda::CUDAEvent::query() this is taken into account (an empty event is always ready), but WorkNCCL extracts the raw cudaEvent_t value from at::cuda::CUDAEvent and calls cudaEventQuery manually and doesn't check this. This could cause a failure. It's unclear if this is ever supposed to happen, but we're seeing that failure, and we want to sort it out in order to see if there's something "deeper" going on. ghstack-source-id: 118532806 Test Plan: Unit tests Reviewed By: SciPioneer Differential Revision: D25537844 fbshipit-source-id: 506319f4742e1c0a02aa75ecc01112ea3be42d8f
1 parent 5a5e576 commit 9234f50

File tree

1 file changed

+1
-5
lines changed

1 file changed

+1
-5
lines changed

torch/lib/c10d/ProcessGroupNCCL.cpp

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -308,11 +308,7 @@ bool ProcessGroupNCCL::WorkNCCL::finishedGPUExecution() {
308308
bool ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const {
309309
for (size_t i = 0; i < devices_.size(); ++i) {
310310
// Checking the work's corresponding CUDA events' status
311-
auto ret = cudaEventQuery((*cudaEvents_)[i]);
312-
if (ret != cudaSuccess && ret != cudaErrorNotReady) {
313-
AT_CUDA_CHECK(ret);
314-
}
315-
if (ret == cudaErrorNotReady) {
311+
if (!(*cudaEvents_)[i].query()) {
316312
return false;
317313
}
318314
}

0 commit comments

Comments
 (0)