-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
high prioritymodule: flaky-testsProblem is a flaky test in CIProblem is a flaky test in CImodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Jun 22 20:31:55 ======================================================================
Jun 22 20:31:55 ERROR [61.489s]: test_backward_ddp_inside (__main__.TestDdpUnderDistAutogradWrapper)
Jun 22 20:31:55 ----------------------------------------------------------------------
Jun 22 20:31:55 Traceback (most recent call last):
Jun 22 20:31:55 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
Jun 22 20:31:55 self._join_processes(fn)
Jun 22 20:31:55 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
Jun 22 20:31:55 self._check_return_codes(elapsed_time)
Jun 22 20:31:55 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes
Jun 22 20:31:55 raise RuntimeError(error)
Jun 22 20:31:55 RuntimeError: Processes 5 exited with error code 10
Jun 22 20:30:53 test_backward_ddp_inside (__main__.TestDdpUnderDistAutogradWrapper) ... 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:327 INFO p:process 2 t:MainThread: Running the trainer #2...
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:329 INFO p:process 2 t:MainThread: Initing trainer process group by trainer #2 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:327 INFO p:process 1 t:MainThread: Running the trainer #1...
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:329 INFO p:process 1 t:MainThread: Initing trainer process group by trainer #1 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:314 INFO p:process 4 t:MainThread: The remote worker is running.
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:327 INFO p:process 3 t:MainThread: Running the trainer #3...
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:329 INFO p:process 3 t:MainThread: Initing trainer process group by trainer #3 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:327 INFO p:process 0 t:MainThread: Running the trainer #0...
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:346 INFO p:process 5 t:MainThread: Running the master process...
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:329 INFO p:process 0 t:MainThread: Initing trainer process group by trainer #0 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,176 ddp_under_dist_autograd_test.py:337 INFO p:process 0 t:MainThread: Waiting for shutdown signal on trainer #0...
Jun 22 20:30:53 2020-06-22 20:30:53,176 ddp_under_dist_autograd_test.py:360 INFO p:process 5 t:MainThread: Created remote rrefs on master
Jun 22 20:30:53 2020-06-22 20:30:53,198 ddp_under_dist_autograd_test.py:98 INFO p:process 4 t:Dummy-1: Initing RemoteEM with 2 3
Jun 22 20:30:53 2020-06-22 20:30:53,199 ddp_under_dist_autograd_test.py:124 INFO p:process 4 t:Dummy-2: Initing RemoteNet with 5 3
Jun 22 20:30:53 2020-06-22 20:30:53,200 ddp_under_dist_autograd_test.py:337 INFO p:process 2 t:MainThread: Waiting for shutdown signal on trainer #2...
Jun 22 20:30:53 2020-06-22 20:30:53,204 ddp_under_dist_autograd_test.py:337 INFO p:process 1 t:MainThread: Waiting for shutdown signal on trainer #1...
Jun 22 20:30:53 2020-06-22 20:30:53,204 ddp_under_dist_autograd_test.py:337 INFO p:process 3 t:MainThread: Waiting for shutdown signal on trainer #3...
Jun 22 20:31:53 ERROR:root:Caught exception:
Jun 22 20:31:53 Traceback (most recent call last):
Jun 22 20:31:53 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 207, in wrapper
Jun 22 20:31:53 fn()
Jun 22 20:31:53 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/dist_utils.py", line 93, in new_test_method
Jun 22 20:31:53 return_value = old_test_method(self, *arg, **kwargs)
Jun 22 20:31:53 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 445, in test_backward_ddp_inside
Jun 22 20:31:53 self._do_test(DdpMode.INSIDE)
Jun 22 20:31:53 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 424, in _do_test
Jun 22 20:31:53 self._master_process(ddp_mode)
Jun 22 20:31:53 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 361, in _master_process
Jun 22 20:31:53 self.do_test_on_master(ddp_mode, remote_em_rref, remote_net_rref)
Jun 22 20:31:53 File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 393, in do_test_on_master
Jun 22 20:31:53 ddp_grads, non_ddp_grads = future.wait()
Jun 22 20:31:53 RuntimeError: RPCErr:1:RPC ran for more than 60000 milliseconds and timed out.
Jun 22 20:31:53 exiting process with exit code: 10
Jun 22 20:31:53 Process 5 terminated with exit code 10, terminating remaining processes.
Jun 22 20:31:53 ERROR (61.489s)
cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jjlilley
Metadata
Metadata
Assignees
Labels
high prioritymodule: flaky-testsProblem is a flaky test in CIProblem is a flaky test in CImodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizerRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module