Skip to content

Conversation

@syed-ahmed
Copy link
Collaborator

@syed-ahmed syed-ahmed commented Aug 15, 2024

This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of #124807.

Stack from ghstack (oldest at bottom):

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133603

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 6a3235e with merge base a440a01 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
syed-ahmed added a commit that referenced this pull request Aug 16, 2024
ghstack-source-id: 3464c7f
Pull Request resolved: #133603
[ghstack-poisoned]
[ghstack-poisoned]
syed-ahmed added a commit that referenced this pull request Aug 27, 2024
ghstack-source-id: 9f253d3
Pull Request resolved: #133603
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
syed-ahmed added a commit that referenced this pull request Aug 29, 2024
ghstack-source-id: e615552
Pull Request resolved: #133603
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the great feature.
I left some comments re API naming, etc.

Comment on lines 693 to 696
void registerUserBuffers(at::Device device);

void deregisterUserBuffers(at::Device device);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding descriptions to these two public methods?

Copy link
Collaborator

@kwen2501 kwen2501 Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the name of the APIs:
after reading the API impl, it seems that we are registering all segments of the context mem pool. So I wonder if "UserBuffers" in the API names are from NCCL perspective rather than from PyTorch perspective? Perhaps we can name them like registerMemPool?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re the argument to the API:
any preference between passing a device or a memPool?
If accepting a device, it implies that the API will get the mem pool from current context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed these functions and added description. Changed input from device to pool.

Comment on lines +2636 to +2638
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "NVLS"
os.environ["NCCL_DEBUG_FILE"] = nccl_debug_file.name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would these three debug flags be still needed when the test is in CI?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these debug flags dump NVLS related printouts from NCCL into a file. That's how I test if NVLS is actually being used or not. I don't think there is a runtime API right now in CUDA to check if NVLS is getting used or not.

os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "NVLS"
os.environ["NCCL_DEBUG_FILE"] = nccl_debug_file.name
os.environ["ALLOCATOR_PATH"] = self.createNcclAllocator()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder where the ALLOCATOR_PATH env is consumed? By the torch library?

pg = c10d.distributed_c10d._get_default_group()
backend = pg._get_backend(torch.device(device))
allocator = torch.cuda.memory.CUDAPluggableAllocator(
os.environ["ALLOCATOR_PATH"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The env is used here. Can we just pass it as self.allocator_path?

Copy link
Collaborator Author

@syed-ahmed syed-ahmed Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to self.allocator_path. Now using local allocator_path variable in test.

Comment on lines 2678 to 2681
# clean up memory
pool.release()
del tensor
pool.empty_cache()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: are these lines required for users?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted these lines. Now only need to del tensor, pool to reclaim memory.

Comment on lines 2673 to 2676
backend.register_user_buffers(device)
pg.allreduce(tensor).wait()
torch.cuda.synchronize(device=device)
backend.deregister_user_buffers(device)
Copy link
Collaborator

@kwen2501 kwen2501 Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below re API name. Would something like register_mem_pool be more accurate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, is it possible to do:

pool = torch.cuda.MemPool(allocator.allocator())
backend.register_mem_pool(pool)

with torch.cuda.use_mem_pool(pool):
    ...

backend.deregister_mem_pool(pool)
pool.release()
pool.empty_cache()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is possible to do the above. Note that registration has to come after the memory has been allocated into the pool, because of how registerMemPool is written. registerMemPool finds all the blocks currently in a pool and then calls ncclCommRegister on them. So if it's called before any allocation, that means it's registering an empty pool.

LOG(INFO) << logPrefix()
<< "Performing user buffer registration on backend device "
<< device << ", key " << key << ", i am " << this;
auto ncclComm = getNCCLComm(key, device, OpType::ALLREDUCE);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind leaving a HACK here?
For enabling NVLS, using OpType::ALLREDUCE is fine. I don't know if we would reuse the same API to register buffers for zero-copy P2P. If we do, at least we know there is a hack that needs change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hack is not needed anymore since getNCCLComm can now get the communicator without OpType.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, OpType::ALLREDUCE is still needed to initialize comms. Left the note about HACK.

[ghstack-poisoned]
syed-ahmed added a commit that referenced this pull request Sep 25, 2024
ghstack-source-id: 886382a
Pull Request resolved: #133603
@izaitsevfb
Copy link
Contributor

@syed-ahmed , apologies for the revert, we needed it to unblock the revert of #140087

Please rebase and reland at your convenience. Thanks!

[ghstack-poisoned]
[ghstack-poisoned]
syed-ahmed added a commit that referenced this pull request Nov 20, 2024
ghstack-source-id: 8aab22f
Pull Request resolved: #133603
@syed-ahmed
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request Dec 2, 2024
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request Dec 2, 2024
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy
@github-actions github-actions bot deleted the gh/syed-ahmed/5/head branch December 21, 2024 02:06
kwen2501 added a commit that referenced this pull request Jan 28, 2025
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(dist.nccl_mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 29, 2025
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: #145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
kwen2501 added a commit that referenced this pull request Jan 29, 2025
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 29, 2025
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: #145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
kwen2501 added a commit that referenced this pull request Jan 30, 2025
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Jan 30, 2025
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: #145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
mori360 pushed a commit to mori360/pytorch that referenced this pull request Feb 6, 2025
This PR implements a small UI improvement over pytorch#133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: pytorch#145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants