Skip to content

fixing broadcast_object_list during torchrun#5615

Closed
myron wants to merge 7 commits intoProject-MONAI:devfrom
myron:cacheauth
Closed

fixing broadcast_object_list during torchrun#5615
myron wants to merge 7 commits intoProject-MONAI:devfrom
myron:cacheauth

Conversation

@myron
Copy link
Copy Markdown
Collaborator

@myron myron commented Nov 30, 2022

Fixes # #5613

This ensures that each process has the same authkey (same as processes 0) during torchrun multi-gpu, to be able to call broadcast_object_list for shared memory.
Here we broadcast the key from the 0 process too all other processes.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@Nic-Ma Nic-Ma requested a review from wyli December 1, 2022 05:13
@wyli wyli removed their request for review December 1, 2022 17:15
if self.runtime_cache:
if self._is_dist:
# ensure each process has the same authkey, otherwise broadcast_object_list fails
mp.current_process().authkey = np.arange(32, dtype=np.uint8).tobytes()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally feel this method of changing the global status of python multi-processing is risky, let's investigate other "safer" solutions or disable the shared-memory support (runtime cache feature) for this case directly.
Add @ericspod @wyli also for more discussion as they are python and pytorch experts.

Thanks in advance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be okay, but another solution is to broadcast authkey from first process to all others (which should be the same already). I had it here in the initial version, I can put it back

    # ensure each process has the same authkey, otherwise broadcast_object_list fails
    authkey = bytearray(mp.current_process().authkey).ljust(32)[:32]
    ak_tensor = torch.from_numpy(np.frombuffer(authkey, dtype=np.uint8))
    if dist.get_backend() == dist.Backend.NCCL:
        ak_tensor = ak_tensor.cuda(torch.cuda.current_device())
    dist.broadcast(ak_tensor, src=0)
    mp.current_process().authkey = ak_tensor.cpu().numpy().tobytes()

another approach is to go back to the SharedDataCache PR that I proposed initially, where the shared cache was initialized the master process and passed an an input option.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the implication could be for changing the auth key in this way. I don't see what other code would do if this were changed but it's possible if this is used in a larger program there may be other concurrent code that is affected, but this would be very rare. If we can do with broadcasting the key afterwards rather than changing global data it's probably better, plus it's more likely to not cause weird issues in Windows where we don't have fork semantics.

@Nic-Ma
Copy link
Copy Markdown
Contributor

Nic-Ma commented Dec 2, 2022

BTW, could you please help test whether our runtime-cache mechanism can work with other start method of the PyTorch DataLoader if not using the default:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L188
I think it can be set by the multiprocessing_context arg?

Thanks in advance.

@myron
Copy link
Copy Markdown
Collaborator Author

myron commented Dec 5, 2022

Since #5630 removed most of the shared cache handling logic, this PR is no longer applicable.

@myron myron closed this Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants