fixing broadcast_object_list during torchrun by myron · Pull Request #5615 · Project-MONAI/MONAI

myron · 2022-11-30T19:01:24Z

Fixes # #5613

This ensures that each process has the same authkey (same as processes 0) during torchrun multi-gpu, to be able to call broadcast_object_list for shared memory.
Here we broadcast the key from the 0 process too all other processes.

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: myron <[email protected]>

Nic-Ma · 2022-12-02T08:33:46Z

monai/data/dataset.py

            if self.runtime_cache:
+                if self._is_dist:
+                    # ensure each process has the same authkey, otherwise broadcast_object_list fails
+                    mp.current_process().authkey = np.arange(32, dtype=np.uint8).tobytes()


I personally feel this method of changing the global status of python multi-processing is risky, let's investigate other "safer" solutions or disable the shared-memory support (runtime cache feature) for this case directly.
Add @ericspod @wyli also for more discussion as they are python and pytorch experts.

Thanks in advance.

I think it should be okay, but another solution is to broadcast authkey from first process to all others (which should be the same already). I had it here in the initial version, I can put it back

# ensure each process has the same authkey, otherwise broadcast_object_list fails authkey = bytearray(mp.current_process().authkey).ljust(32)[:32] ak_tensor = torch.from_numpy(np.frombuffer(authkey, dtype=np.uint8)) if dist.get_backend() == dist.Backend.NCCL: ak_tensor = ak_tensor.cuda(torch.cuda.current_device()) dist.broadcast(ak_tensor, src=0) mp.current_process().authkey = ak_tensor.cpu().numpy().tobytes()

another approach is to go back to the SharedDataCache PR that I proposed initially, where the shared cache was initialized the master process and passed an an input option.

I'm not sure what the implication could be for changing the auth key in this way. I don't see what other code would do if this were changed but it's possible if this is used in a larger program there may be other concurrent code that is affected, but this would be very rare. If we can do with broadcasting the key afterwards rather than changing global data it's probably better, plus it's more likely to not cause weird issues in Windows where we don't have fork semantics.

Nic-Ma · 2022-12-02T08:36:03Z

BTW, could you please help test whether our runtime-cache mechanism can work with other start method of the PyTorch DataLoader if not using the default:
https://github.com/pytorch/pytorch/blob/master/torch/utils/data/dataloader.py#L188
I think it can be set by the multiprocessing_context arg?

Thanks in advance.

myron · 2022-12-05T16:23:01Z

Since #5630 removed most of the shared cache handling logic, this PR is no longer applicable.

myron added 2 commits November 30, 2022 10:38

fixing broadcast_object_list during torchrun

089ad0b

Signed-off-by: myron <[email protected]>

checking for nccl

80ce13e

Signed-off-by: myron <[email protected]>

myron added this to the Auto3D Seg framework [internal ongoing milestone] milestone Nov 30, 2022

myron requested a review from Nic-Ma November 30, 2022 19:01

fix for older PT versions

612f900

Signed-off-by: myron <[email protected]>

Nic-Ma requested a review from wyli December 1, 2022 05:13

myron added 3 commits November 30, 2022 23:29

simplification

a7ef207

Signed-off-by: myron <[email protected]>

Merge remote-tracking branch 'upstream/dev' into cacheauth

d0d6ee5

arange

97f50b4

Signed-off-by: myron <[email protected]>

wyli removed their request for review December 1, 2022 17:15

Merge branch 'dev' into cacheauth

6a0189f

Nic-Ma reviewed Dec 2, 2022

View reviewed changes

Nic-Ma mentioned this pull request Dec 2, 2022

revise cachedataset runtime_cache modes #5630

Merged

7 tasks

wyli mentioned this pull request Dec 2, 2022

CacheDataset shared cache issues #5633

Closed

myron closed this Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixing broadcast_object_list during torchrun#5615

fixing broadcast_object_list during torchrun#5615
myron wants to merge 7 commits intoProject-MONAI:devfrom
myron:cacheauth

myron commented Nov 30, 2022

Uh oh!

Nic-Ma Dec 2, 2022

Uh oh!

myron Dec 2, 2022

Uh oh!

ericspod Dec 2, 2022

Uh oh!

Nic-Ma commented Dec 2, 2022

Uh oh!

myron commented Dec 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

myron commented Nov 30, 2022

Types of changes

Uh oh!

Nic-Ma Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

myron Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

ericspod Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

Nic-Ma commented Dec 2, 2022

Uh oh!

myron commented Dec 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants