Resource constraints and work stealing not working as expected, the cluster ends up stuck

Hello,

Since it is pretty hard to reproduce this at the moment, but the problem is happening often enough for us to stop ignoring it. 

Setup: 
1) 100 16 core boxes with single worker each started with --resources 'cpu=16'
2) 3 task types: A: {'cpu':16}, B: {'cpu':4}, C: {'cpu':1}, tasks B and C are python subprocess calls, and task A is a function with a call to sklearn .fit().
3) work stealing is enabled
4) there are about 2000+ tasks in the queue normally


Observed bug:
At some point the available_resources dictionary in the worker object becomes incorrect. Not sure what leads to this state but something is broken in the state transition mechanism so that it would look like a particular worker has more available resources than it should. This cases overscheduling of tasks into the worker.

Example:
2021-01-21 07:52:02.021 INFO:  {'tcp://172.24.0.144:39033': (9,)}, available_resources: {'cpu': 3.0}, used_calculated_outside_dask: 24
2021-01-21 07:52:02.021 INFO:   i('A-f29492ccfd701faea20c09e4193c0db9', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-2636fbf9c5cf07db2eb3088985835730', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-28deaee4c9494ef2b376f8246894e5cb', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-2561aa2b50164e905a0582810c91d8c7', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-c75dcc64cdc7a19890fd516fede9fa28', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-75251f721ba1b411c80e7bb08fa79245', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-3d87f41690c6f1a08941628371c6e491', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-2fc85a2cee66cefff07fa4d6db5e38bc', 'executing')
2021-01-21 07:52:02.021 INFO:   i('C-7d4de834d8bb02db9a4f097e8815c5cc', 'executing')

An even bigger problem is that after some time of running these many tasks, the worker becomes unresponsive, and the scheduler starts reporting:
`OSError: Timed out during handshake while connecting to tcp://172.24.0.147:35091 after 25 s`
while clients get:
`OSError: Timed out trying to connect to tcp://172.24.0.114:42141 after 25 s`
At this point the whole cluster becomes unusable, until we kill the timed out worker (which seems to be ok load/ram wise)

When I logged in into worker and ran strace on the dask-worker process, I got this:

Looks like similar to this issue: https://github.com/dask/distributed/issues/2880
or this https://github.com/dask/distributed/issues/2446
 
I will try to work towards a minimal reproducible example but it might be tough, and certainly will take a while. 

Maybe you have some suggestions in the meanwhile?

**Environment**:
Python 3.7.6 (default, Jan 30 2020, 03:53:38) 
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
dask.__version__
'2021.01.0'`

- Install method (conda, pip, source): pip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions