-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
high prioritymodule: flaky-testsProblem is a flaky test in CIProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
18:57:48 test_gloo_backend_4gpu_module (__main__.DistributedDataParallelTest) ... Memory access fault by GPU node-9 (Agent handle: 0x22f2130) on address 0x7f4d5bf3d000. Reason: Page not present or supervisor privilege.
18:57:49 ERROR:root:Caught exception:
18:57:49 Traceback (most recent call last):
18:57:49 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 133, in wrapper
18:57:49 fn(self)
18:57:49 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 46, in wrapper
18:57:49 return func(*args, **kwargs)
18:57:49 File "test_c10d.py", line 1982, in test_gloo_backend_4gpu_module
18:57:49 self._test_gloo_backend(devices, [], multi_device=True)
18:57:49 File "test_c10d.py", line 1950, in _test_gloo_backend
18:57:49 self._test_ddp_with_process_group(process_group, devices, device_ids, multi_device)
18:57:49 File "test_c10d.py", line 1907, in _test_ddp_with_process_group
18:57:49 process_group, devices, device_ids, global_batch_size)
18:57:49 File "test_c10d.py", line 1886, in _prepare_multi_device_module
18:57:49 bucket_cap_mb=0.001)
18:57:49 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 303, in __init__
18:57:49 self.broadcast_bucket_size)
18:57:49 File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced
18:57:49 dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
18:57:49 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:563] Read error [127.0.0.1]:52399: Connection reset by peer
19:06:51 ======================================================================
19:06:51 FAIL: test_gloo_backend_4gpu_module (__main__.DistributedDataParallelTest)
19:06:51 ----------------------------------------------------------------------
19:06:51 Traceback (most recent call last):
19:06:51 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 130, in wrapper
19:06:51 self._join_processes(fn)
19:06:51 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 211, in _join_processes
19:06:51 self._check_return_codes(elapsed_time)
19:06:51 File "/var/lib/jenkins/workspace/test/common_distributed.py", line 231, in _check_return_codes
19:06:51 self.assertEqual(p.exitcode, first_process.exitcode)
19:06:51 File "/var/lib/jenkins/workspace/test/common_utils.py", line 815, in assertEqual
19:06:51 super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
19:06:51 AssertionError: 16 not less than or equal to 1e-05 :
cc @ezyang @gchanan @zou3519 @jerryzh168 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528
Metadata
Metadata
Assignees
Labels
high prioritymodule: flaky-testsProblem is a flaky test in CIProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module