Skip to content

test_gloo_backend_4gpu_module is flaky #30110

@mrshenli

Description

@mrshenli

https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/8007//console

18:57:48 test_gloo_backend_4gpu_module (__main__.DistributedDataParallelTest) ... Memory access fault by GPU node-9 (Agent handle: 0x22f2130) on address 0x7f4d5bf3d000. Reason: Page not present or supervisor privilege.
18:57:49 ERROR:root:Caught exception: 
18:57:49 Traceback (most recent call last):
18:57:49   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 133, in wrapper
18:57:49     fn(self)
18:57:49   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 46, in wrapper
18:57:49     return func(*args, **kwargs)
18:57:49   File "test_c10d.py", line 1982, in test_gloo_backend_4gpu_module
18:57:49     self._test_gloo_backend(devices, [], multi_device=True)
18:57:49   File "test_c10d.py", line 1950, in _test_gloo_backend
18:57:49     self._test_ddp_with_process_group(process_group, devices, device_ids, multi_device)
18:57:49   File "test_c10d.py", line 1907, in _test_ddp_with_process_group
18:57:49     process_group, devices, device_ids, global_batch_size)
18:57:49   File "test_c10d.py", line 1886, in _prepare_multi_device_module
18:57:49     bucket_cap_mb=0.001)
18:57:49   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 303, in __init__
18:57:49     self.broadcast_bucket_size)
18:57:49   File "/var/lib/jenkins/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 485, in _distributed_broadcast_coalesced
18:57:49     dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
18:57:49 RuntimeError: [/var/lib/jenkins/workspace/third_party/gloo/gloo/transport/tcp/pair.cc:563] Read error [127.0.0.1]:52399: Connection reset by peer
19:06:51 ======================================================================
19:06:51 FAIL: test_gloo_backend_4gpu_module (__main__.DistributedDataParallelTest)
19:06:51 ----------------------------------------------------------------------
19:06:51 Traceback (most recent call last):
19:06:51   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 130, in wrapper
19:06:51     self._join_processes(fn)
19:06:51   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 211, in _join_processes
19:06:51     self._check_return_codes(elapsed_time)
19:06:51   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 231, in _check_return_codes
19:06:51     self.assertEqual(p.exitcode, first_process.exitcode)
19:06:51   File "/var/lib/jenkins/workspace/test/common_utils.py", line 815, in assertEqual
19:06:51     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
19:06:51 AssertionError: 16 not less than or equal to 1e-05 : 

cc @ezyang @gchanan @zou3519 @jerryzh168 @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528

Metadata

Metadata

Assignees

No one assigned

    Labels

    high prioritymodule: flaky-testsProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions