Skip to content

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Nov 22, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/141374

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 94348a1 with merge base 740d1eb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category labels Nov 22, 2024
A new store is used to recreate PGs upon restart. Here are the step to achieve new store in CI:
- Rank 0 creates a temp file, share the file name via an old store. 
- Non-0 rank get the file name via the old store. 
- All ranks creates a new store.

There was existing code for the above steps, this PR just exact it out into a util function, and share with other restart tests.

Also moved restart tests together under `NcclErrorHandlingTest` class.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Nov 22, 2024
ghstack-source-id: 34eb3bb
Pull Request resolved: #141374
@kwen2501 kwen2501 changed the title [c10d][CI] Add new-file util for PG restart tests [c10d][CI] Use new store for PG restart tests Nov 22, 2024
pytorchmergebot pushed a commit that referenced this pull request Nov 25, 2024
…calls (#141270)

### Motivation
`ncclCommInitRank` needs GPU guard (documented in NCCL).

`ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.

### Solution
This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs.

### Note
This is not a bug fix. Just a safety improvement.

Pull Request resolved: #141270
Approved by: https://github.com/eqy
ghstack dependencies: #141374
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
A new Store is used to recreate PGs upon restart. Achieve the new Store by adding prefix.

Pull Request resolved: pytorch#141374
Approved by: https://github.com/fduwjj
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…calls (pytorch#141270)

### Motivation
`ncclCommInitRank` needs GPU guard (documented in NCCL).

`ncclCommAbort`, `ncclCommFinalize` and `ncclCommDestroy` may also need GPU guard (undocumented in NCCL); otherwise, extra CUDA context may be created (or worse, hang); both effects have been seen before in our tests.

### Solution
This PR records a device index during `NCCLComm` object creation, so that we can add GPU guard in `NCCLComm`'s method calling which direct to the above NCCL APIs.

### Note
This is not a bug fix. Just a safety improvement.

Pull Request resolved: pytorch#141270
Approved by: https://github.com/eqy
ghstack dependencies: pytorch#141374
@github-actions github-actions bot deleted the gh/kwen2501/102/head branch December 26, 2024 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants