Skip to content

test_nccl_errors_blocking_clean_exit is flaky #31924

@mrshenli

Description

@mrshenli

https://app.circleci.com/jobs/github/pytorch/pytorch/4155093

Jan 07 20:36:55 ======================================================================
Jan 07 20:36:55 FAIL [3.205s]: test_nccl_errors_blocking_clean_exit (__main__.NcclErrorHandlingTest)
Jan 07 20:36:55 ----------------------------------------------------------------------
Jan 07 20:36:55 Traceback (most recent call last):
Jan 07 20:36:55   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 130, in wrapper
Jan 07 20:36:55     self._join_processes(fn)
Jan 07 20:36:55   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 211, in _join_processes
Jan 07 20:36:55     self._check_return_codes(elapsed_time)
Jan 07 20:36:55   File "/var/lib/jenkins/workspace/test/common_distributed.py", line 231, in _check_return_codes
Jan 07 20:36:55     self.assertEqual(p.exitcode, first_process.exitcode)
Jan 07 20:36:55   File "/var/lib/jenkins/workspace/test/common_utils.py", line 888, in assertEqual
Jan 07 20:36:55     super(TestCase, self).assertLessEqual(abs(x - y), prec, message)
Jan 07 20:36:55 AssertionError: 11 not less than or equal to 1e-05 : 

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @ezyang @gchanan @zou3519

Metadata

Metadata

Assignees

Labels

high prioritymodule: flaky-testsProblem is a flaky test in CIoncall: distributedAdd this issue/PR to distributed oncall triage queuetriage reviewtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions