fixing broadcast_object_list during torchrun#5615
fixing broadcast_object_list during torchrun#5615myron wants to merge 7 commits intoProject-MONAI:devfrom myron:cacheauth
Conversation
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
Signed-off-by: myron <[email protected]>
| if self.runtime_cache: | ||
| if self._is_dist: | ||
| # ensure each process has the same authkey, otherwise broadcast_object_list fails | ||
| mp.current_process().authkey = np.arange(32, dtype=np.uint8).tobytes() |
There was a problem hiding this comment.
I personally feel this method of changing the global status of python multi-processing is risky, let's investigate other "safer" solutions or disable the shared-memory support (runtime cache feature) for this case directly.
Add @ericspod @wyli also for more discussion as they are python and pytorch experts.
Thanks in advance.
There was a problem hiding this comment.
I think it should be okay, but another solution is to broadcast authkey from first process to all others (which should be the same already). I had it here in the initial version, I can put it back
# ensure each process has the same authkey, otherwise broadcast_object_list fails
authkey = bytearray(mp.current_process().authkey).ljust(32)[:32]
ak_tensor = torch.from_numpy(np.frombuffer(authkey, dtype=np.uint8))
if dist.get_backend() == dist.Backend.NCCL:
ak_tensor = ak_tensor.cuda(torch.cuda.current_device())
dist.broadcast(ak_tensor, src=0)
mp.current_process().authkey = ak_tensor.cpu().numpy().tobytes()
another approach is to go back to the SharedDataCache PR that I proposed initially, where the shared cache was initialized the master process and passed an an input option.
There was a problem hiding this comment.
I'm not sure what the implication could be for changing the auth key in this way. I don't see what other code would do if this were changed but it's possible if this is used in a larger program there may be other concurrent code that is affected, but this would be very rare. If we can do with broadcasting the key afterwards rather than changing global data it's probably better, plus it's more likely to not cause weird issues in Windows where we don't have fork semantics.
|
BTW, could you please help test whether our runtime-cache mechanism can work with other Thanks in advance. |
|
Since #5630 removed most of the shared cache handling logic, this PR is no longer applicable. |
Fixes # #5613
This ensures that each process has the same authkey (same as processes 0) during torchrun multi-gpu, to be able to call broadcast_object_list for shared memory.
Here we broadcast the key from the 0 process too all other processes.
Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.make htmlcommand in thedocs/folder.