Skip to content

Deadlock - task not running #5366

@chrisroat

Description

@chrisroat

What happened:

When running on an auto-scaling GKE cluster using dask-gateway, I sometimes find computation halting mid-graph. One or more workers will have tasks, but not actual by doing any work. Often the logs contain a traceback from some sort of failure.

@fjetter

What you expected to happen:

Graph to finish to completion.

Minimal Complete Verifiable Example:

I don't have a reproducable example, as the graph can sometimes succeed and will often succeed if I kill the worker(s) that are stuck. I am including scheduler and worker info as per @fjetter 's script (#5068), as well as GKE logs showing a traceback earlier in the process.

Anything else we need to know?:

K8s log
distributed.worker - ERROR - failed during get data with tls://10.16.132.2:38819 -> tls://10.16.95.2:33407

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed

distributed.core - INFO - Lost connection to 'tls://10.16.95.2:44406'
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 495, in handle_comm result = await result File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed

Environment:

  • Dask version: 2021.09.1+9.g9f587507
  • Distributed version: 2021.09.1+14.gef281377
  • Python version: python3.8
  • Operating System: ubuntu 20.04
  • Install method (conda, pip, source): pip

scheduler.pkl.gz
worker.pkl.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions