Skip to content

Conversation

@crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented May 30, 2022

  • Fix issue where, if a Worker not wrapped by Nanny trips @fail_hard in the test suite, the whole test suite is killed with an opaque 'exit 1'
  • Fix issue where, after a worker started by @gen_cluster has been closed for whatever reason - namely, by @fail_hard - it is spared from state validation, potentially resulting in a green test.

total = c.submit(sum, L)
result = await total
assert result == sum(map(inc, range(10)))
kill_task = asyncio.create_task(n.kill())
Copy link
Collaborator Author

@crusaderky crusaderky May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Do not leave uncollected task at the end of the test
  • Test separately the use case of failed worker -> other workers vs. other workers -> failed worker

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test seems very timing sensitive, how far through is the n.kill() task and what state is the Nanny in by the time the Client._gather_remote( gets called?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg, is the test still valid if asyncio.gather( is used to communicate the concurrency here?

        compute_addr = n.worker_address if compute_on_failed else a.address

        async def compute_total():
            return await c.submit(sum, L, workers=[compute_addr], allow_other_workers=True)

        total, _ = await asyncio.gather(compute_total(), n.kill())
        assert total == sum(range(1, 11))

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is indeed very timing sensitive, and as I discovered elsewhere it can need to run a good 100+ times to trigger unexpected behaviour. I don't plan to fix this in the scope of this PR. The change is just to clean up the code.

crusaderky added a commit to crusaderky/distributed that referenced this pull request May 30, 2022
or not self.batched_stream.comm
or self.batched_stream.comm.closed()
):
return # pragma: nocover
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be removed after #6475

@github-actions
Copy link
Contributor

github-actions bot commented May 30, 2022

Unit Test Results

       15 files  ±  0         15 suites  ±0   6h 12m 15s ⏱️ - 21m 39s
  2 820 tests +  1    2 739 ✔️ +  3    80 💤 ±  0  1  - 2 
20 910 runs  +14  19 916 ✔️  - 42  993 💤 +58  1  - 2 

For more details on these failures, see this check.

Results for commit 3efcf2c. ± Comparison against base commit 5feb171.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky marked this pull request as ready for review May 30, 2022 19:14
@crusaderky crusaderky self-assigned this May 30, 2022
@crusaderky crusaderky linked an issue May 30, 2022 that may be closed by this pull request
# deadlocks the cluster.
if not self.nanny:
# We're likely in a unit test. Don't kill the whole test suite!
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many cases where folks don't use a nanny in production, but we would still want to raise. "no nanny" doesn't imply "in tests".

Instead, I recommend checking a config value here and then setting that config value in testing infrastructure.

Alternatively, maybe we set a global in utils_test for TESTING or something like that.

Also alternatively, self.validate is probably a good signal that we're in tests today.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for self.validate

) from e

def validate_state(self):
if self.status not in WORKER_ANY_RUNNING:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not happen anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does happen after @fail_hard closes the worker, and it's why I'm removing it.
With it, a test may be green even if a worker has suicided.

In production, the cluster must be transparently resilient to failures on individual workers (that's the philosophy behind fail_hard)
In tests, it really really shouldn't.

@crusaderky crusaderky merged commit e16ed18 into dask:main Jun 1, 2022
@crusaderky crusaderky deleted the fail_hard branch June 1, 2022 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Yank state machine out of Worker class

4 participants