Cluster hangs with a few tasks in "processing" state but no cpu load on any workers

This problem is stochastic. It seems to occur more frequently when there is more sharing of data between workers. `map_overlap` calls seem particularly problematic.

Cluster is set up using `dask-jobqueue.LSFCluster` and `dask.distributed.Client`

```python
cluster = LSFCluster(
    cores, ncpus, memory, mem,
    walltime=walltime,
    env_extra=env_extra,
    **kwargs,
)
client = Client(cluster)
cluster.scale(job=njobs)  # number of workers
```

Workers are all allocated properly, bash scripts invoking LSF all seem fine. The task graph starts to execute, but then gets hung up and sits indefinitely in this type of state:

<img width="1619" alt="Screen Shot 2021-04-09 at 12 26 36 PM" src="https://user-images.githubusercontent.com/8507206/114213199-b212f980-9930-11eb-9e02-2eff750519a0.png">

<img width="1619" alt="Screen Shot 2021-04-09 at 12 27 24 PM" src="https://user-images.githubusercontent.com/8507206/114213258-c3f49c80-9930-11eb-8251-4d2cc510e849.png">

No workers show any cpu activity (2-4% for all workers). `env_extra` above makes sure all MKL, BLAS, and OpenMP environment variables are set to 2 threads per core (should be fine with hyper threading?).

When I click on the red task on the left of the graph I see:
[hung_cluster_last_task_left.pdf](https://github.com/dask/dask/files/6287304/hung_cluster_last_task_left.pdf)

When I click on the red task on the right of the graph (second to last column) I see:
[hung_cluster_last_task.pdf](https://github.com/dask/dask/files/6287305/hung_cluster_last_task.pdf)

For the red task on the right, the two "workers with data" show:

<img width="1619" alt="Screen Shot 2021-04-09 at 12 28 30 PM" src="https://user-images.githubusercontent.com/8507206/114214276-1aaea600-9932-11eb-86dc-2f18e70b78fa.png">

<img width="1619" alt="Screen Shot 2021-04-09 at 12 28 32 PM" src="https://user-images.githubusercontent.com/8507206/114214283-1d110000-9932-11eb-9411-1ebe5942278b.png">


I've let these hang for upwards of 30 minutes with no meaningful cpu activity on any workers before killing the cluster manually. I can't let it run any longer because I'm paying for cluster time so I don't know if it's just (intractably) slow or totally hung. Comparatively the entire rest of the task graph was executed in less than 180 seconds.

Any pointers as to what could be causing this or how to permanently avoid it would be really appreciated.


- Dask version:    2020.12.0
- Python version:    3.8.5
- Operating System:    CentOS
- Install method (conda, pip, source):    pip


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cluster hangs with a few tasks in "processing" state but no cpu load on any workers #4724

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cluster hangs with a few tasks in "processing" state but no cpu load on any workers #4724

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions