Skip to content

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Nov 6, 2024

Stack from ghstack (oldest at bottom):

Reverts PR #137523

Reasons for the reversion:

  1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that torch.distributed is at a better position to be either an opener or a provider. (The PR to be reverted made torch.distributed an opener).

  2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable NCCL_PROFILER_PLUGIN_FUN that provides the symbol name, as in code below:

    const std::string profilerPluginFun = getCvarString(
    {"NCCL_PROFILER_PLUGIN_FUN"}, "_Z22ncclProfilerPluginDumpB5cxx11v");
    if (profilerPluginFun.empty()) {
    LOG(WARNING) << "NCCL_PROFILER_PLUGIN_FUN is empty";
    return ncclDumpMap;
    }
    std::
    unordered_map<std::string, std::unordered_map<std::string, std::string>> (
    *dumpFn)() =
    (std::unordered_map<
    std::string,
    std::unordered_map<std::string, std::string>>(*)())
    dlsym(handle, profilerPluginFun.c_str());

    After some investigation, NCCL does not support env var NCCL_PROFILER_PLUGIN_FUN. And NCCL's profiler contract nccl_profiler.h does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139847

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7ac5c27 with merge base c19c384 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Nov 6, 2024
kwen2501 added a commit that referenced this pull request Nov 6, 2024
…CCL profiler plugin

ghstack-source-id: a314b61
Pull Request resolved: #139847
@kwen2501 kwen2501 added the topic: not user facing topic category label Nov 6, 2024
@kwen2501 kwen2501 changed the title [Revert][NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin Revert #137523: Add functionality to call dump function of NCCL profiler plugin Nov 6, 2024
@wconstab
Copy link
Contributor

wconstab commented Nov 7, 2024

Can you provide some more documentation of what the NCCL profiler plugin is supposed to be used for? I'm not sure if I understand your point (1), is the claim that we are accessing some non-public parts of nccl internals? or is this supposed to be a user-supplied plugin to NCCL, but you just don't think pytorch should be involved?

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 7, 2024

Sort of both.

nccl_profiler.h defines a contract between (1) NCCL and (2) a user or a profiler who's willing to implement a plugin to be run by NCCL. For example, NSight or kineto can provide such implementation so that they can have a "pulse" of NCCL activities (because NCCL will call it), then they can perhaps display the activities like displaying the CUDA one). Those activities being profiled today are pretty "low-level", such as activities of the NCCL proxy thread.

The functions in the plugin are for internal callback of NCCL. By no means should torch.distributed dlopen and run a function from it.

@wconstab
Copy link
Contributor

wconstab commented Nov 7, 2024

The functions in the plugin are for internal callback of NCCL. By no means should torch.distributed dlopen and run a function from it.

Well, if we're allowed to build a profiler that takes nccl details and dumps them to a visualizer format on disk, why can't we build a profiler that stores them in memory and lets the torch flight recorder dump their state?

After some investigation, NCCL does not support env var NCCL_PROFILER_PLUGIN_FUN. And NCCL's profiler contract nccl_profiler.h does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on

the bigger issue with #137523 IMO is that it does not actually provide the complete package. It's missing the actual profiler plugin, which means the integration code is not 'generally' useful, especially since it assumes a particular contract without defining an API. I think it would be better if the functionality were actually usable out of the box.

Copy link
Contributor

@fduwjj fduwjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think network AI team add this on purpose. cc: @c-p-i-o

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 7, 2024

I agree that anyone can write and build a libnccl_profiler.so. It is just that it is not for torch.distributed to dlopen and run it. Said differently, torch.distributed did not define a plugin contract for itself (so that it can accept a plugin). But what the PR under discussion does is exactly dlopen + run.

The APIs in nccl_profiler.h (as well as nccl_net.h -- network plugin) are different from the APIs in nccl.h -- only the latter are meant to be called by NCCL's user.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 7, 2024

I have asked before whether Flight Recorder would be interested in accepting plugins (for itself, not for NCCL), as in:

  • FR defines a set of API signatures, like record(string collName, int nElem, int pgId), dump(), etc, and release them via a .h file;
  • A 3rd-party interested would provide their own implementation for the record, dump functions, build them into a say libtorch_fr.so;
  • FR promises to dlopen libtorch_fr.so, use those run, dump functions to replace the in-house implementations of the same functions, and run record(...), dump() at some point of ProcessGroupNCCL's code (which FR already does);
  • That would allow that 3rd-party to get trace/dump of ProcessGroupNCCL in their own format.

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 7, 2024

Hmm I think network AI team add this on purpose. cc: @c-p-i-o

Yes - the network AI team added this item in anticipation of the new NCCL library being rolled out everywhere.

The functions in the plugin are for internal callback of NCCL. By no means should torch.distributed dlopen and run a function from it.

Well, if we're allowed to build a profiler that takes nccl details and dumps them to a visualizer format on disk, why can't we build a profiler that stores them in memory and lets the torch flight recorder dump their state?

After some investigation, NCCL does not support env var NCCL_PROFILER_PLUGIN_FUN. And NCCL's profiler contract nccl_profiler.h does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on

the bigger issue with #137523 IMO is that it does not actually provide the complete package. It's missing the actual profiler plugin, which means the integration code is not 'generally' useful, especially since it assumes a particular contract without defining an API. I think it would be better if the functionality were actually usable out of the box.

Yes. Both the plugin and analysis code are meant to be open sourced.
Let me check with them on the actual timelines.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 7, 2024

Btw, NCCL's profiler plugin defines a finalize API:

  // finalize - finalize the profiler plugin
  // Input
  //  - context: opaque profiler context object
  ncclResult_t (*finalize)(void* context);

NCCL will call this finalize, and an implementation of it can trigger a dump, as demo'ed in this example:
https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/ext-profiler/example/plugin.c#L130-L134
I am not sure why torch needs to trigger the dump on NCCL's behalf.

@c-p-i-o
Copy link
Contributor

c-p-i-o commented Nov 7, 2024

Spoke with network.ai. Ok to revert now.
We can sync back with @zhiyongww and forge a viable path for re-enabling similar functionality in PyTorch.

@c-p-i-o c-p-i-o self-requested a review November 7, 2024 21:26
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Nov 7, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 7, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

zero000064 pushed a commit to zero000064/pytorch that referenced this pull request Nov 14, 2024
…L profiler plugin (pytorch#139847)

Reverts PR pytorch#137523

Reasons for the reversion:
1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener).

2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below:
https://github.com/pytorch/pytorch/blob/c19c38469030cdf399714eb2051887f4583006a8/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L415-L427
After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.

Pull Request resolved: pytorch#139847
Approved by: https://github.com/c-p-i-o
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…L profiler plugin (pytorch#139847)

Reverts PR pytorch#137523

Reasons for the reversion:
1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener).

2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below:
https://github.com/pytorch/pytorch/blob/c19c38469030cdf399714eb2051887f4583006a8/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L415-L427
After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.

Pull Request resolved: pytorch#139847
Approved by: https://github.com/c-p-i-o
@github-actions github-actions bot deleted the gh/kwen2501/93/head branch December 8, 2024 02:17
fmo-mt pushed a commit to fmo-mt/pytorch that referenced this pull request Dec 11, 2024
…L profiler plugin (pytorch#139847)

Reverts PR pytorch#137523

Reasons for the reversion:
1. NCCL profiler plugin is meant to be opened by NCCL. And the profiler's implementation is meant to be provided by a profiler. There is no evidence that `torch.distributed` is at a better position to be either an opener or a provider. (The PR to be reverted made `torch.distributed` an opener).

2. The main purpose of the reverted PR is to dlopen a dump function, with the help of an environment variable `NCCL_PROFILER_PLUGIN_FUN` that provides the symbol name, as in code below:
https://github.com/pytorch/pytorch/blob/c19c38469030cdf399714eb2051887f4583006a8/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L415-L427
After some investigation, NCCL does not support env var `NCCL_PROFILER_PLUGIN_FUN`. And NCCL's profiler contract `nccl_profiler.h` does not have a function called "ncclProfilerPluginDump" defined. So this looks like a private add-on.

Pull Request resolved: pytorch#139847
Approved by: https://github.com/c-p-i-o
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants