Skip to content

test_worker_start_exception flaky on MacOS #4519

@crusaderky

Description

@crusaderky

For #4504 I ran stress tests on the test suite on github actions.

test_worker_start_exception:

  • on MacOS, it failed 6 times over 44 runs
  • on Linux/Windows, it never failed over 144 runs

Logs:
https://github.com/dask/distributed/pull/4504/checks?check_run_id=1926114137
https://github.com/dask/distributed/pull/4504/checks?check_run_id=1926114157
https://github.com/dask/distributed/pull/4504/checks?check_run_id=1931940247
https://github.com/dask/distributed/pull/4504/checks?check_run_id=1931940388
https://github.com/dask/distributed/pull/4504/checks?check_run_id=1931940432
https://github.com/dask/distributed/pull/4504/checks?check_run_id=1931940438

Sample log:

_________________________ test_worker_start_exception __________________________

cleanup = None

    @pytest.mark.asyncio
    async def test_worker_start_exception(cleanup):
        # make sure this raises the right Exception:
        with pytest.raises(StartException):
            async with Nanny("tcp://localhost:1", worker_class=BrokenWorker) as n:
>               await n.start()
E               Failed: DID NOT RAISE <class 'test_nanny.StartException'>

distributed/tests/test_nanny.py:576: Failed
----------------------------- Captured stderr call -----------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 766, in run
    await worker
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 284, in _
    await self.start()
  File "/Users/runner/work/distributed/distributed/distributed/tests/test_nanny.py", line 568, in start
    raise StartException("broken")
test_nanny.StartException: broken
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/Users/runner/work/distributed/distributed/distributed/nanny.py", line 766, in run
    await worker
  File "/Users/runner/work/distributed/distributed/distributed/core.py", line 284, in _
    await self.start()
  File "/Users/runner/work/distributed/distributed/distributed/tests/test_nanny.py", line 568, in start
    raise StartException("broken")
test_nanny.StartException: broken

Note how the exception is correctly raised and logged on stderr, but it is not thrown by Nanny.start for some reason.

Metadata

Metadata

Assignees

No one assigned

    Labels

    flaky testIntermittent failures on CI.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions