-
-
Notifications
You must be signed in to change notification settings - Fork 749
Remove worker reconnect #6361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove worker reconnect #6361
Conversation
Unit Test Results 15 files ± 0 15 suites ±0 7h 7m 11s ⏱️ - 5m 36s For more details on these failures, see this check. Results for commit 305a23f. ± Comparison against base commit ff94776. ♻️ This comment has been updated with latest results. |
3ce0c96 to
c708ef0
Compare
The Nanny can still restart the worker process.
…mid_compute_multiple_states_on_scheduler
…sk_memory_with_resources`
821b99e to
7c769cd
Compare
|
Everything is green except one flaky That feels flaky to me, but I guess it's possible this is related? |
| # Remove suspicious workers from the scheduler and shut them down. | ||
| await asyncio.gather( | ||
| *( | ||
| self.remove_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe heavy-handed. retire_workers would probably be less disruptive, but slower and a little more complex. I don't love losing all the other keys on a worker just because it's missing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind be heavy handed, especially if things are in a wonky state.
mrocklin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete review, but I'm stuck in meetings for the next bit
| # Remove suspicious workers from the scheduler and shut them down. | ||
| await asyncio.gather( | ||
| *( | ||
| self.remove_worker( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind be heavy handed, especially if things are in a wonky state.
|
In general I'm ok with what's here |
gjoseph92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we discussed offline was whether the Nanny should restart the worker in cases when the connection to the scheduler is broken unexpectedly. The answer is probably yes, but as this PR stands, Worker.close will always tell the Nanny it's closing gracefully, preventing restart:
distributed/distributed/worker.py
Lines 1475 to 1477 in ff94776
| if nanny and self.nanny: | |
| with self.rpc(self.nanny) as r: | |
| await r.close_gracefully() |
As a follow-up PR, we should add an option to Worker.close to control this behavior.
I may not be understanding the statement above, but we already have the |
Also make worker log whatever error the scheduler gives it upon connection
Great point, I just missed that. Yes, we can set |
|
@mrocklin actually, I think maybe we should do this in a follow-up PR. It's pretty straightforward, but involves a little bit of thinking and a few more tests (the nanny is going to try to report the worker closure to the scheduler at the same time the worker's connection is breaking from it shutting down). Easy enough, but I'd rather focus on the core change here and not increase the scope. |
`reconnect=True` (previous default) is now the only option. This is not a necessary change to make. It just simplifies things not not have it. See discussion in dask#6361 (comment).
Unnecessary after dask#6361 is merged
was leaking file descriptors, and needed an event loop
|
Failures (both seem flaky):
|
|
Thanks @gjoseph92 . This is in. |
When a worker disconnects from the scheduler, close it immediately instead of trying to reconnect.
Also prohibit workers from joining if they have data in memory, as an alternative to #6341.
Just seeing what CI does for now, to figure out which tests to change/remove.
Closes #6350
pre-commit run --all-files