-
-
Notifications
You must be signed in to change notification settings - Fork 750
Open
Description
What happened:
When running on an auto-scaling GKE cluster using dask-gateway, I sometimes find computation halting mid-graph. One or more workers will have tasks, but not actual by doing any work. Often the logs contain a traceback from some sort of failure.
What you expected to happen:
Graph to finish to completion.
Minimal Complete Verifiable Example:
I don't have a reproducable example, as the graph can sometimes succeed and will often succeed if I kill the worker(s) that are stuck. I am including scheduler and worker info as per @fjetter 's script (#5068), as well as GKE logs showing a traceback earlier in the process.
Anything else we need to know?:
K8s log
distributed.worker - ERROR - failed during get data with tls://10.16.132.2:38819 -> tls://10.16.95.2:33407
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed
distributed.core - INFO - Lost connection to 'tls://10.16.95.2:44406'
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read frames_nbytes = await stream.read_bytes(fmt_size) tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/distributed/core.py", line 495, in handle_comm result = await result File "/opt/conda/lib/python3.8/site-packages/distributed/worker.py", line 1527, in get_data response = await comm.read(deserializers=serializers) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 214, in read convert_stream_closed_error(self, e) File "/opt/conda/lib/python3.8/site-packages/distributed/comm/tcp.py", line 128, in convert_stream_closed_error raise CommClosedError(f"in {obj}: {exc}") from exc distributed.comm.core.CommClosedError: in : Stream is closed
Environment:
- Dask version: 2021.09.1+9.g9f587507
- Distributed version: 2021.09.1+14.gef281377
- Python version: python3.8
- Operating System: ubuntu 20.04
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels