-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Revert #137523: Add functionality to call dump function of NCCL profiler plugin #139847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…CCL profiler plugin [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139847
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7ac5c27 with merge base c19c384 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Can you provide some more documentation of what the NCCL profiler plugin is supposed to be used for? I'm not sure if I understand your point (1), is the claim that we are accessing some non-public parts of nccl internals? or is this supposed to be a user-supplied plugin to NCCL, but you just don't think pytorch should be involved? |
|
Sort of both.
The functions in the plugin are for internal callback of NCCL. By no means should |
Well, if we're allowed to build a profiler that takes nccl details and dumps them to a visualizer format on disk, why can't we build a profiler that stores them in memory and lets the torch flight recorder dump their state?
the bigger issue with #137523 IMO is that it does not actually provide the complete package. It's missing the actual profiler plugin, which means the integration code is not 'generally' useful, especially since it assumes a particular contract without defining an API. I think it would be better if the functionality were actually usable out of the box. |
fduwjj
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think network AI team add this on purpose. cc: @c-p-i-o
|
I agree that anyone can write and build a The APIs in |
|
I have asked before whether Flight Recorder would be interested in accepting plugins (for itself, not for NCCL), as in:
|
Yes - the network AI team added this item in anticipation of the new NCCL library being rolled out everywhere.
Yes. Both the plugin and analysis code are meant to be open sourced. |
|
Btw, NCCL's profiler plugin defines a finalize API: NCCL will call this finalize, and an implementation of it can trigger a dump, as demo'ed in this example: |
|
Spoke with network.ai. Ok to revert now. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…L profiler plugin (pytorch#139847) Reverts PR pytorch#137523 Reasons for the reversion: 1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener). 2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below: https://github.com/pytorch/pytorch/blob/c19c38469030cdf399714eb2051887f4583006a8/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L415-L427 After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on. Pull Request resolved: pytorch#139847 Approved by: https://github.com/c-p-i-o
…L profiler plugin (pytorch#139847) Reverts PR pytorch#137523 Reasons for the reversion: 1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener). 2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below: https://github.com/pytorch/pytorch/blob/c19c38469030cdf399714eb2051887f4583006a8/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L415-L427 After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on. Pull Request resolved: pytorch#139847 Approved by: https://github.com/c-p-i-o
…L profiler plugin (pytorch#139847) Reverts PR pytorch#137523 Reasons for the reversion: 1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener). 2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below: https://github.com/pytorch/pytorch/blob/c19c38469030cdf399714eb2051887f4583006a8/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L415-L427 After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on. Pull Request resolved: pytorch#139847 Approved by: https://github.com/c-p-i-o
Stack from ghstack (oldest at bottom):
Reverts PR #137523
Reasons for the reversion:
NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that
torch.distributedis at a better position to be either an opener or a provider. (The PR to be reverted madetorch.distributedan opener).The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable
NCCL_PROFILER_PLUGIN_FUNthat provides the symbol name, as in code below:pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Lines 415 to 427 in c19c384
After some investigation, NCCL does not support env var
NCCL_PROFILER_PLUGIN_FUN. And NCCL's profiler contractnccl_profiler.hdoes not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o