-
-
Notifications
You must be signed in to change notification settings - Fork 750
Description
Hello,
Since it is pretty hard to reproduce this at the moment, but the problem is happening often enough for us to stop ignoring it.
Setup:
- 100 16 core boxes with single worker each started with --resources 'cpu=16'
- 3 task types: A: {'cpu':16}, B: {'cpu':4}, C: {'cpu':1}, tasks B and C are python subprocess calls, and task A is a function with a call to sklearn .fit().
- work stealing is enabled
- there are about 2000+ tasks in the queue normally
Observed bug:
At some point the available_resources dictionary in the worker object becomes incorrect. Not sure what leads to this state but something is broken in the state transition mechanism so that it would look like a particular worker has more available resources than it should. This cases overscheduling of tasks into the worker.
Example:
2021-01-21 07:52:02.021 INFO: {'tcp://172.24.0.144:39033': (9,)}, available_resources: {'cpu': 3.0}, used_calculated_outside_dask: 24
2021-01-21 07:52:02.021 INFO: i('A-f29492ccfd701faea20c09e4193c0db9', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2636fbf9c5cf07db2eb3088985835730', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-28deaee4c9494ef2b376f8246894e5cb', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2561aa2b50164e905a0582810c91d8c7', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-c75dcc64cdc7a19890fd516fede9fa28', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-75251f721ba1b411c80e7bb08fa79245', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-3d87f41690c6f1a08941628371c6e491', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2fc85a2cee66cefff07fa4d6db5e38bc', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-7d4de834d8bb02db9a4f097e8815c5cc', 'executing')
An even bigger problem is that after some time of running these many tasks, the worker becomes unresponsive, and the scheduler starts reporting:
OSError: Timed out during handshake while connecting to tcp://172.24.0.147:35091 after 25 s
while clients get:
OSError: Timed out trying to connect to tcp://172.24.0.114:42141 after 25 s
At this point the whole cluster becomes unusable, until we kill the timed out worker (which seems to be ok load/ram wise)
When I logged in into worker and ran strace on the dask-worker process, I got this:
Looks like similar to this issue: #2880
or this #2446
I will try to work towards a minimal reproducible example but it might be tough, and certainly will take a while.
Maybe you have some suggestions in the meanwhile?
Environment:
Python 3.7.6 (default, Jan 30 2020, 03:53:38)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
dask.version
'2021.01.0'`
- Install method (conda, pip, source): pip