-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Implements user buffer registration using MemPool #133603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133603
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 6a3235e with merge base a440a01 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the great feature.
I left some comments re API naming, etc.
| void registerUserBuffers(at::Device device); | ||
|
|
||
| void deregisterUserBuffers(at::Device device); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding descriptions to these two public methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the name of the APIs:
after reading the API impl, it seems that we are registering all segments of the context mem pool. So I wonder if "UserBuffers" in the API names are from NCCL perspective rather than from PyTorch perspective? Perhaps we can name them like registerMemPool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re the argument to the API:
any preference between passing a device or a memPool?
If accepting a device, it implies that the API will get the mem pool from current context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed these functions and added description. Changed input from device to pool.
| os.environ["NCCL_DEBUG"] = "INFO" | ||
| os.environ["NCCL_DEBUG_SUBSYS"] = "NVLS" | ||
| os.environ["NCCL_DEBUG_FILE"] = nccl_debug_file.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would these three debug flags be still needed when the test is in CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these debug flags dump NVLS related printouts from NCCL into a file. That's how I test if NVLS is actually being used or not. I don't think there is a runtime API right now in CUDA to check if NVLS is getting used or not.
test/distributed/test_c10d_nccl.py
Outdated
| os.environ["NCCL_DEBUG"] = "INFO" | ||
| os.environ["NCCL_DEBUG_SUBSYS"] = "NVLS" | ||
| os.environ["NCCL_DEBUG_FILE"] = nccl_debug_file.name | ||
| os.environ["ALLOCATOR_PATH"] = self.createNcclAllocator() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder where the ALLOCATOR_PATH env is consumed? By the torch library?
test/distributed/test_c10d_nccl.py
Outdated
| pg = c10d.distributed_c10d._get_default_group() | ||
| backend = pg._get_backend(torch.device(device)) | ||
| allocator = torch.cuda.memory.CUDAPluggableAllocator( | ||
| os.environ["ALLOCATOR_PATH"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The env is used here. Can we just pass it as self.allocator_path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to . Now using local self.allocator_pathallocator_path variable in test.
test/distributed/test_c10d_nccl.py
Outdated
| # clean up memory | ||
| pool.release() | ||
| del tensor | ||
| pool.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: are these lines required for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted these lines. Now only need to del tensor, pool to reclaim memory.
test/distributed/test_c10d_nccl.py
Outdated
| backend.register_user_buffers(device) | ||
| pg.allreduce(tensor).wait() | ||
| torch.cuda.synchronize(device=device) | ||
| backend.deregister_user_buffers(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below re API name. Would something like register_mem_pool be more accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, is it possible to do:
pool = torch.cuda.MemPool(allocator.allocator())
backend.register_mem_pool(pool)
with torch.cuda.use_mem_pool(pool):
...
backend.deregister_mem_pool(pool)
pool.release()
pool.empty_cache()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it is possible to do the above. Note that registration has to come after the memory has been allocated into the pool, because of how registerMemPool is written. registerMemPool finds all the blocks currently in a pool and then calls ncclCommRegister on them. So if it's called before any allocation, that means it's registering an empty pool.
| LOG(INFO) << logPrefix() | ||
| << "Performing user buffer registration on backend device " | ||
| << device << ", key " << key << ", i am " << this; | ||
| auto ncclComm = getNCCLComm(key, device, OpType::ALLREDUCE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind leaving a HACK here?
For enabling NVLS, using OpType::ALLREDUCE is fine. I don't know if we would reuse the same API to register buffers for zero-copy P2P. If we do, at least we know there is a hack that needs change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hack is not needed anymore since getNCCLComm can now get the communicator without OpType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, OpType::ALLREDUCE is still needed to initialize comms. Left the note about HACK.
|
@syed-ahmed , apologies for the revert, we needed it to unblock the revert of #140087 Please rebase and reland at your convenience. Thanks! |
…3603)" This reverts commit 25d9be3. Reverted pytorch#133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#133603 (comment)))
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of pytorch#124807. Pull Request resolved: pytorch#133603 Approved by: https://github.com/kwen2501, https://github.com/eqy
…3603)" This reverts commit 25d9be3. Reverted pytorch#133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#133603 (comment)))
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of pytorch#124807. Pull Request resolved: pytorch#133603 Approved by: https://github.com/kwen2501, https://github.com/eqy
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of pytorch#124807. Pull Request resolved: pytorch#133603 Approved by: https://github.com/kwen2501, https://github.com/eqy
…3603)" This reverts commit 25d9be3. Reverted pytorch#133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#133603 (comment)))
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs. Part of pytorch#124807. Pull Request resolved: pytorch#133603 Approved by: https://github.com/kwen2501, https://github.com/eqy
This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(dist.nccl_mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: #145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: #145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]
This PR implements a small UI improvement over #133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: #145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
This PR implements a small UI improvement over pytorch#133603. It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it. UI: ``` pool = torch.cuda.MemPool(backend.mem_allocator) with torch.cuda.use_mem_pool(pool): tensor = torch.arange(1024 * 1024 * 2, device=device) ``` Pull Request resolved: pytorch#145675 Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.
Part of #124807.
Stack from ghstack (oldest at bottom):
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o