Skip to content

DISABLED test_backward_ddp_inside (__main__.ProcessGroupDdpUnderDistAutogradTestWithSpawn) #40434

@mrshenli

Description

@mrshenli

https://app.circleci.com/pipelines/github/pytorch/pytorch/184349/workflows/b01c5601-2b54-4717-89a6-0e66284d487c/jobs/5970923/steps

Jun 22 20:31:55 ======================================================================
Jun 22 20:31:55 ERROR [61.489s]: test_backward_ddp_inside (__main__.TestDdpUnderDistAutogradWrapper)
Jun 22 20:31:55 ----------------------------------------------------------------------
Jun 22 20:31:55 Traceback (most recent call last):
Jun 22 20:31:55   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 204, in wrapper
Jun 22 20:31:55     self._join_processes(fn)
Jun 22 20:31:55   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 306, in _join_processes
Jun 22 20:31:55     self._check_return_codes(elapsed_time)
Jun 22 20:31:55   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 339, in _check_return_codes
Jun 22 20:31:55     raise RuntimeError(error)
Jun 22 20:31:55 RuntimeError: Processes 5 exited with error code 10
Jun 22 20:30:53   test_backward_ddp_inside (__main__.TestDdpUnderDistAutogradWrapper) ... 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:327 INFO p:process 2 t:MainThread: Running the trainer #2...
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:329 INFO p:process 2 t:MainThread: Initing trainer process group by trainer #2 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:327 INFO p:process 1 t:MainThread: Running the trainer #1...
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:329 INFO p:process 1 t:MainThread: Initing trainer process group by trainer #1 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:314 INFO p:process 4 t:MainThread: The remote worker is running.
Jun 22 20:30:53 2020-06-22 20:30:53,172 ddp_under_dist_autograd_test.py:327 INFO p:process 3 t:MainThread: Running the trainer #3...
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:329 INFO p:process 3 t:MainThread: Initing trainer process group by trainer #3 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:327 INFO p:process 0 t:MainThread: Running the trainer #0...
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:346 INFO p:process 5 t:MainThread: Running the master process...
Jun 22 20:30:53 2020-06-22 20:30:53,173 ddp_under_dist_autograd_test.py:329 INFO p:process 0 t:MainThread: Initing trainer process group by trainer #0 with ranks [0, 1, 2, 3]
Jun 22 20:30:53 2020-06-22 20:30:53,176 ddp_under_dist_autograd_test.py:337 INFO p:process 0 t:MainThread: Waiting for shutdown signal on trainer #0...
Jun 22 20:30:53 2020-06-22 20:30:53,176 ddp_under_dist_autograd_test.py:360 INFO p:process 5 t:MainThread: Created remote rrefs on master
Jun 22 20:30:53 2020-06-22 20:30:53,198 ddp_under_dist_autograd_test.py:98 INFO p:process 4 t:Dummy-1: Initing RemoteEM with 2 3
Jun 22 20:30:53 2020-06-22 20:30:53,199 ddp_under_dist_autograd_test.py:124 INFO p:process 4 t:Dummy-2: Initing RemoteNet with 5 3
Jun 22 20:30:53 2020-06-22 20:30:53,200 ddp_under_dist_autograd_test.py:337 INFO p:process 2 t:MainThread: Waiting for shutdown signal on trainer #2...
Jun 22 20:30:53 2020-06-22 20:30:53,204 ddp_under_dist_autograd_test.py:337 INFO p:process 1 t:MainThread: Waiting for shutdown signal on trainer #1...
Jun 22 20:30:53 2020-06-22 20:30:53,204 ddp_under_dist_autograd_test.py:337 INFO p:process 3 t:MainThread: Waiting for shutdown signal on trainer #3...
Jun 22 20:31:53 ERROR:root:Caught exception: 
Jun 22 20:31:53 Traceback (most recent call last):
Jun 22 20:31:53   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/common_distributed.py", line 207, in wrapper
Jun 22 20:31:53     fn()
Jun 22 20:31:53   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/dist_utils.py", line 93, in new_test_method
Jun 22 20:31:53     return_value = old_test_method(self, *arg, **kwargs)
Jun 22 20:31:53   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 445, in test_backward_ddp_inside
Jun 22 20:31:53     self._do_test(DdpMode.INSIDE)
Jun 22 20:31:53   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 424, in _do_test
Jun 22 20:31:53     self._master_process(ddp_mode)
Jun 22 20:31:53   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 361, in _master_process
Jun 22 20:31:53     self.do_test_on_master(ddp_mode, remote_em_rref, remote_net_rref)
Jun 22 20:31:53   File "/Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/torch/testing/_internal/distributed/ddp_under_dist_autograd_test.py", line 393, in do_test_on_master
Jun 22 20:31:53     ddp_grads, non_ddp_grads = future.wait()
Jun 22 20:31:53 RuntimeError: RPCErr:1:RPC ran for more than 60000 milliseconds and timed out.
Jun 22 20:31:53 exiting process with exit code: 10
Jun 22 20:31:53 Process 5 terminated with exit code 10, terminating remaining processes.
Jun 22 20:31:53 ERROR (61.489s)

cc @ezyang @gchanan @zou3519 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jjlilley

Metadata

Metadata

Labels

high prioritymodule: flaky-testsProblem is a flaky test in CImodule: rpcRelated to RPC, distributed autograd, RRef, and distributed optimizeroncall: distributedAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions