-
-
Notifications
You must be signed in to change notification settings - Fork 750
Join Nanny watch thread #6146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join Nanny watch thread #6146
Conversation
I hope that this helps to avoid interrmittent failures like the following: ``` E AssertionError: (<Thread(AsyncProcess Dask Worker process (from Nanny) watch process join, started daemon 123145591922688)>, [' File ...3.8/multiprocessing/popen_fork.py", line 47, in wait E \treturn self.poll(os.WNOHANG if timeout == 0.0 else 0) E ', ...]) E assert False ``` https://github.com/dask/distributed/runs/6048484591?check_suite_focus=true
| with suppress(ValueError): | ||
| child_stop_q.close() # probably redundant | ||
| child_stop_q.join_thread() | ||
| thread.join(timeout=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty happy that these are introduced from the POV that there should be explicit attempts to cleanup resources. Even more so if tornado.IOLoop.run_sync can raise errors other than TimeoutError and KeyboardInterrupt.
I would suggest removal of # probably redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, they probably are redundant. It's just that occasionally they're not. The failure here is very rare. I'd like to call that out.
| n = await Nanny(s.address, nthreads=2, loop=s.loop) | ||
| while len(s.workers) < 3: | ||
| await asyncio.sleep(0.1) | ||
| async with Nanny(s.address, nthreads=2) as n: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit for my own comprehension -- can we always assume that s.loop is IOLoop.current()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside any async context, yes.
| assert not a.has_what.get(n_worker_address) | ||
| assert not any( | ||
| n_worker_address in s for ts in a.tasks.values() for s in ts.who_has | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A local vscode diff indicates that these changes are merely indents (github doesn't do such a good job here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
Introduced in dask#6146. We don't know why this code path is only being triggered now.
Introduced in dask#6146. We don't know why this code path is only being triggered now.
I hope that this helps to avoid interrmittent failures like the
following:
https://github.com/dask/distributed/runs/6048484591?check_suite_focus=true
pre-commit run --all-files