Skip to content

Resource constraints and work stealing not working as expected, the cluster ends up stuck #4446

@dtenchurin

Description

@dtenchurin

Hello,

Since it is pretty hard to reproduce this at the moment, but the problem is happening often enough for us to stop ignoring it.

Setup:

  1. 100 16 core boxes with single worker each started with --resources 'cpu=16'
  2. 3 task types: A: {'cpu':16}, B: {'cpu':4}, C: {'cpu':1}, tasks B and C are python subprocess calls, and task A is a function with a call to sklearn .fit().
  3. work stealing is enabled
  4. there are about 2000+ tasks in the queue normally

Observed bug:
At some point the available_resources dictionary in the worker object becomes incorrect. Not sure what leads to this state but something is broken in the state transition mechanism so that it would look like a particular worker has more available resources than it should. This cases overscheduling of tasks into the worker.

Example:
2021-01-21 07:52:02.021 INFO: {'tcp://172.24.0.144:39033': (9,)}, available_resources: {'cpu': 3.0}, used_calculated_outside_dask: 24
2021-01-21 07:52:02.021 INFO: i('A-f29492ccfd701faea20c09e4193c0db9', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2636fbf9c5cf07db2eb3088985835730', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-28deaee4c9494ef2b376f8246894e5cb', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2561aa2b50164e905a0582810c91d8c7', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-c75dcc64cdc7a19890fd516fede9fa28', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-75251f721ba1b411c80e7bb08fa79245', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-3d87f41690c6f1a08941628371c6e491', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-2fc85a2cee66cefff07fa4d6db5e38bc', 'executing')
2021-01-21 07:52:02.021 INFO: i('C-7d4de834d8bb02db9a4f097e8815c5cc', 'executing')

An even bigger problem is that after some time of running these many tasks, the worker becomes unresponsive, and the scheduler starts reporting:
OSError: Timed out during handshake while connecting to tcp://172.24.0.147:35091 after 25 s
while clients get:
OSError: Timed out trying to connect to tcp://172.24.0.114:42141 after 25 s
At this point the whole cluster becomes unusable, until we kill the timed out worker (which seems to be ok load/ram wise)

When I logged in into worker and ran strace on the dask-worker process, I got this:

Looks like similar to this issue: #2880
or this #2446

I will try to work towards a minimal reproducible example but it might be tough, and certainly will take a while.

Maybe you have some suggestions in the meanwhile?

Environment:
Python 3.7.6 (default, Jan 30 2020, 03:53:38)
[GCC 7.3.1 20180712 (Red Hat 7.3.1-6)] on linux
dask.version
'2021.01.0'`

  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions