Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

Summary:
ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.

For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.

Test Plan:

  1. Test with 2.4.8
  2. Test with 2.2.13
  3. unit tests.

Differential Revision: D17178988

@pytorchbot pytorchbot added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nccl Problems related to nccl support labels Sep 4, 2019
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this impact the watchdog thread? Will it effectively be a big nop now?

LGTM otherwise.

@pritamdamania87
Copy link
Contributor Author

How does this impact the watchdog thread? Will it effectively be a big nop now?

Ensured we don't start this thread when error checking is disabled.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks! Let's make sure all CI pass before landing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This caused build failure:

Sep 05 20:06:57 CMakeFiles/ProcessGroupNCCLErrorsTest.dir/ProcessGroupNCCLErrorsTest.cpp.o: In function `ProcessGroupNCCLErrorsTest_testNCCLErrorsNonBlocking_Test::TestBody()':
Sep 05 20:06:57 ProcessGroupNCCLErrorsTest.cpp:(.text+0x4f1): undefined reference to `torch::cuda::nccl::version()'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files don't link against all of libtorch. It's better to use the macros here.

@pytorchbot pytorchbot added the module: build Build system issues label Sep 10, 2019
@jithunnair-amd
Copy link
Collaborator

Is this PR just waiting on a conflict resolution at this point?

@pritamdamania87
Copy link
Contributor Author

@jithunnair-amd A few CI builds are still failing, I need to fix them.

…25634)

Summary:
Pull Request resolved: #25634

ncclCommAbort() and ncclGetAsyncError() were two APIs added in NCCL
2.4 to detect errors in NCCL communicators. These were used as part of
ProcesGroupNCCL and we also enforced that only NCCL versions 2.4+ were
supported. Although, there is still legitimate use for older NCCL versions and
hence we should still support those.

For that purpose, in this change I've ensured we disable NCCL error checking
for versions < 2.4.

Test Plan:
1) Test with 2.4.8
2) Test with 2.2.13
3) unit tests.

Differential Revision: D17178988

fbshipit-source-id: 757d4cf7c85dbfab804341b8b96fb480c4dda00b
@jithunnair-amd
Copy link
Collaborator

@pritamdamania87 Will this PR be merged soon?

@pritamdamania87
Copy link
Contributor Author

@jithunnair-amd Apologize for the delay on this, I was tied up with a few other PRs. I spent some time this week trying to fix the build issue but haven't made much progress on that front. Will give it another shot soon when I have some free cycles.

@pritamdamania87
Copy link
Contributor Author

Abandoning in favor of #27124 (which was exported using ghexport).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: build Build system issues module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants