Implements user buffer registration using MemPool #133603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

syed-ahmed wants to merge 43 commits into gh/syed-ahmed/5/base from gh/syed-ahmed/5/head

Collaborator

syed-ahmed commented Aug 15, 2024 •

edited

Loading

This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of #124807.

Stack from ghstack (oldest at bottom):

-> Implements user buffer registration using MemPool #133603

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o


          Update

d54b71d

[ghstack-poisoned]

This was referenced Aug 15, 2024

Uses MemPoolContext to route allocations from CUDACachingAllocator #133599

Merged

Properly uses ref-counting for torch.cuda.use_mem_pool #133600

Closed

pytorch-bot bot added oncall: distributed release notes: distributed (c10d) labels

syed-ahmed mentioned this pull request

Adds snapshot API for MemPools to get pool memory segments #133601

Closed

pytorch-bot bot commented Aug 15, 2024 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133603

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit 6a3235e with merge base a440a01 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

syed-ahmed mentioned this pull request

Refactors empty_cache to return only MemPool memory to the system #133602

Closed

syed-ahmed added a commit that referenced this pull request


          Implements user buffer registration using MemPool

767daa1

ghstack-source-id: ffd50c8
Pull Request resolved: #133603

pytorchbot added the open source label


          Update

bbc3235

[ghstack-poisoned]

syed-ahmed added a commit that referenced this pull request


          Implements user buffer registration using MemPool

0fe7df3

ghstack-source-id: 3464c7f
Pull Request resolved: #133603


          Update

7533f56

[ghstack-poisoned]


          Update

1b8effa

[ghstack-poisoned]

syed-ahmed added a commit that referenced this pull request


          Implements user buffer registration using MemPool

ghstack-source-id: 9f253d3
Pull Request resolved: #133603


          Update

a638045

[ghstack-poisoned]


          Update

2bcc6e1

[ghstack-poisoned]


          Update

eadc2c5

[ghstack-poisoned]


          Update

3ca681c

[ghstack-poisoned]

syed-ahmed added a commit that referenced this pull request


          Implements user buffer registration using MemPool

3213cd7

ghstack-source-id: e615552
Pull Request resolved: #133603

kwen2501 approved these changes

View reviewed changes

Collaborator

kwen2501 left a comment

LGTM. Thanks for the great feature.
I left some comments re API naming, etc.

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp Outdated

Comment on lines 693 to 696

    
                void registerUserBuffers(at::Device device);

                void deregisterUserBuffers(at::Device device);

Collaborator

kwen2501 Sep 24, 2024

Do you mind adding descriptions to these two public methods?

Collaborator

kwen2501 Sep 24, 2024 •

edited

Loading

Re the name of the APIs:
after reading the API impl, it seems that we are registering all segments of the context mem pool. So I wonder if "UserBuffers" in the API names are from NCCL perspective rather than from PyTorch perspective? Perhaps we can name them like registerMemPool?

Collaborator

kwen2501 Sep 24, 2024

Re the argument to the API:
any preference between passing a device or a memPool?
If accepting a device, it implies that the API will get the mem pool from current context?

Collaborator Author

syed-ahmed Nov 14, 2024

Renamed these functions and added description. Changed input from device to pool.

test/distributed/test_c10d_nccl.py

Comment on lines +2636 to +2638

    
                      os.environ["NCCL_DEBUG"] = "INFO"

                      os.environ["NCCL_DEBUG_SUBSYS"] = "NVLS"

                      os.environ["NCCL_DEBUG_FILE"] = nccl_debug_file.name

Collaborator

kwen2501 Sep 24, 2024

nit: would these three debug flags be still needed when the test is in CI?

Collaborator Author

syed-ahmed Nov 14, 2024

Yes, these debug flags dump NVLS related printouts from NCCL into a file. That's how I test if NVLS is actually being used or not. I don't think there is a runtime API right now in CUDA to check if NVLS is getting used or not.

test/distributed/test_c10d_nccl.py Outdated

    
                      os.environ["NCCL_DEBUG"] = "INFO"

                      os.environ["NCCL_DEBUG_SUBSYS"] = "NVLS"

                      os.environ["NCCL_DEBUG_FILE"] = nccl_debug_file.name

                      os.environ["ALLOCATOR_PATH"] = self.createNcclAllocator()

Collaborator

kwen2501 Sep 24, 2024

I wonder where the ALLOCATOR_PATH env is consumed? By the torch library?

test/distributed/test_c10d_nccl.py Outdated

    
                      pg = c10d.distributed_c10d._get_default_group()

                      backend = pg._get_backend(torch.device(device))

                      allocator = torch.cuda.memory.CUDAPluggableAllocator(

                          os.environ["ALLOCATOR_PATH"],

Collaborator

kwen2501 Sep 24, 2024

I see. The env is used here. Can we just pass it as self.allocator_path?

Collaborator Author

syed-ahmed Nov 14, 2024 •

edited

Loading

~~Changed to self.allocator_path~~. Now using local allocator_path variable in test.

test/distributed/test_c10d_nccl.py Outdated

Comment on lines 2678 to 2681

    
                      # clean up memory

                      pool.release()

                      del tensor

                      pool.empty_cache()

Collaborator

kwen2501 Sep 24, 2024

Curious: are these lines required for users?

Collaborator Author

syed-ahmed Nov 14, 2024

Deleted these lines. Now only need to del tensor, pool to reclaim memory.

test/distributed/test_c10d_nccl.py Outdated

Comment on lines 2673 to 2676

    
                          backend.register_user_buffers(device)

                          pg.allreduce(tensor).wait()

                          torch.cuda.synchronize(device=device)

                          backend.deregister_user_buffers(device)

Collaborator

kwen2501 Sep 24, 2024 •

edited

Loading

See my comment below re API name. Would something like register_mem_pool be more accurate?

Collaborator

kwen2501 Sep 24, 2024

And, is it possible to do:

pool = torch.cuda.MemPool(allocator.allocator())
backend.register_mem_pool(pool)

with torch.cuda.use_mem_pool(pool):
    ...

backend.deregister_mem_pool(pool)
pool.release()
pool.empty_cache()

Collaborator Author

syed-ahmed Nov 14, 2024

Now it is possible to do the above. Note that registration has to come after the memory has been allocated into the pool, because of how registerMemPool is written. registerMemPool finds all the blocks currently in a pool and then calls ncclCommRegister on them. So if it's called before any allocation, that means it's registering an empty pool.

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Outdated

    
                LOG(INFO) << logPrefix()

                          << "Performing user buffer registration on backend device "

                          << device << ", key " << key << ", i am " << this;

                auto ncclComm = getNCCLComm(key, device, OpType::ALLREDUCE);

Collaborator

kwen2501 Sep 24, 2024

Do you mind leaving a HACK here?
For enabling NVLS, using OpType::ALLREDUCE is fine. I don't know if we would reuse the same API to register buffers for zero-copy P2P. If we do, at least we know there is a hack that needs change.

Collaborator Author

syed-ahmed Nov 14, 2024

Hack is not needed anymore since getNCCLComm can now get the communicator without OpType.

Collaborator Author

syed-ahmed Nov 15, 2024

nvm, OpType::ALLREDUCE is still needed to initialize comms. Left the note about HACK.

kwen2501 mentioned this pull request

[RFC] Offload collectives to NVSwitch when possible #136567

Open


          Update

3bddcf7

[ghstack-poisoned]

syed-ahmed added a commit that referenced this pull request


          Implements user buffer registration using MemPool

ccf0f9b

ghstack-source-id: 886382a
Pull Request resolved: #133603

pytorchmergebot reopened this

Contributor

izaitsevfb commented Nov 19, 2024

@syed-ahmed , apologies for the revert, we needed it to unblock the revert of #140087

Please rebase and reland at your convenience. Thanks!

pytorchmergebot added a commit to jakeharmon8/pytorch that referenced this pull request


          Revert "Implements user buffer registration using MemPool (pytorch#13…

…3603)"

This reverts commit 25d9be3.

Reverted pytorch#133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#133603 (comment)))


          Update

f379e18

[ghstack-poisoned]


          Update

6a3235e

[ghstack-poisoned]

syed-ahmed added a commit that referenced this pull request


          Implements user buffer registration using MemPool

02f3332

ghstack-source-id: 8aab22f
Pull Request resolved: #133603

Collaborator Author

syed-ahmed commented Nov 20, 2024

@pytorchbot merge

pytorchmergebot added the merging label

Collaborator

pytorchmergebot commented Nov 20, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot closed this in

e0482fd

pytorchmergebot removed the merging label

Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request


          Implements user buffer registration using MemPool (pytorch#133603)

4b64069

This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy

Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request


          Revert "Implements user buffer registration using MemPool (pytorch#13…

ded2d32

…3603)"

This reverts commit 25d9be3.

Reverted pytorch#133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#133603 (comment)))

Ryo-not-rio pushed a commit to Ryo-not-rio/pytorch that referenced this pull request


          Implements user buffer registration using MemPool (pytorch#133603)

d1b4acc

This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Implements user buffer registration using MemPool (pytorch#133603)

This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Revert "Implements user buffer registration using MemPool (pytorch#13…

018770f

…3603)"

This reverts commit 25d9be3.

Reverted pytorch#133603 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](pytorch#133603 (comment)))

pobin6 pushed a commit to pobin6/pytorch that referenced this pull request


          Implements user buffer registration using MemPool (pytorch#133603)

ad2d43f

This PR implements user buffer registration and demonstrates NVLink Sharp (NVLS) reductions using a combination of allocation special memory using MemPool and registering it with the nccl buffer registration APIs.

Part of pytorch#124807.

Pull Request resolved: pytorch#133603
Approved by: https://github.com/kwen2501, https://github.com/eqy

github-actions bot deleted the gh/syed-ahmed/5/head branch

December 21, 2024 02:06

kwen2501 mentioned this pull request

[c10d] Add NCCL memory allocator #145675

Closed

kwen2501 added a commit that referenced this pull request


          Update on "[c10d] Add NCCL memory allocator"

442f497

This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(dist.nccl_mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request


          [c10d] Add NCCL memory allocator (#145675)

9fd6722

This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: #145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab

kwen2501 added a commit that referenced this pull request


          Update on "[c10d] Add NCCL memory allocator"

a2f2606

This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request


          [c10d] Add NCCL memory allocator (#145675)

18a7a04

This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: #145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab

kwen2501 added a commit that referenced this pull request


          Update on "[c10d] Add NCCL memory allocator"

d310c91

This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request


          [c10d] Add NCCL memory allocator (#145675)

51ee9b1

This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: #145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab

mori360 pushed a commit to mori360/pytorch that referenced this pull request


          [c10d] Add NCCL memory allocator (pytorch#145675)

97e95a0

This PR implements a small UI improvement over pytorch#133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: pytorch#145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td ciflow/trunk Merged oncall: distributed open source release notes: distributed (c10d) Reverted