-
-
Notifications
You must be signed in to change notification settings - Fork 750
@fail_hard can kill the whole test suite; hide errors
#6474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| total = c.submit(sum, L) | ||
| result = await total | ||
| assert result == sum(map(inc, range(10))) | ||
| kill_task = asyncio.create_task(n.kill()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Do not leave uncollected task at the end of the test
- Test separately the use case of failed worker -> other workers vs. other workers -> failed worker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test seems very timing sensitive, how far through is the n.kill() task and what state is the Nanny in by the time the Client._gather_remote( gets called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eg, is the test still valid if asyncio.gather( is used to communicate the concurrency here?
compute_addr = n.worker_address if compute_on_failed else a.address
async def compute_total():
return await c.submit(sum, L, workers=[compute_addr], allow_other_workers=True)
total, _ = await asyncio.gather(compute_total(), n.kill())
assert total == sum(range(1, 11))There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is indeed very timing sensitive, and as I discovered elsewhere it can need to run a good 100+ times to trigger unexpected behaviour. I don't plan to fix this in the scope of this PR. The change is just to clean up the code.
| or not self.batched_stream.comm | ||
| or self.batched_stream.comm.closed() | ||
| ): | ||
| return # pragma: nocover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be removed after #6475
Unit Test Results 15 files ± 0 15 suites ±0 6h 12m 15s ⏱️ - 21m 39s For more details on these failures, see this check. Results for commit 3efcf2c. ± Comparison against base commit 5feb171. ♻️ This comment has been updated with latest results. |
| # deadlocks the cluster. | ||
| if not self.nanny: | ||
| # We're likely in a unit test. Don't kill the whole test suite! | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many cases where folks don't use a nanny in production, but we would still want to raise. "no nanny" doesn't imply "in tests".
Instead, I recommend checking a config value here and then setting that config value in testing infrastructure.
Alternatively, maybe we set a global in utils_test for TESTING or something like that.
Also alternatively, self.validate is probably a good signal that we're in tests today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added check for self.validate
| ) from e | ||
|
|
||
| def validate_state(self): | ||
| if self.status not in WORKER_ANY_RUNNING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this not happen anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does happen after @fail_hard closes the worker, and it's why I'm removing it.
With it, a test may be green even if a worker has suicided.
In production, the cluster must be transparently resilient to failures on individual workers (that's the philosophy behind fail_hard)
In tests, it really really shouldn't.
@fail_hardin the test suite, the whole test suite is killed with an opaque 'exit 1'@gen_clusterhas been closed for whatever reason - namely, by@fail_hard- it is spared from state validation, potentially resulting in a green test.