Skip to content

Timed out trying to connect to host after 10 s #2880

@seanlaw

Description

@seanlaw

I have a dask distributed cluster up and running on 40 workers:

dask_client = Client('localhost:8786')
dask_client.restart()
dask_client

I've restarted everything so no tasks are queued and the scheduler log shows:

distributed.scheduler - INFO - Clear task state

I have a large csr sparse matrix that I am scattering to the cluster:

csr_future = dask_client.scatter(csr, broadcast=True)

After a few seconds, I see:

distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:38615
distributed.core - INFO - Removing comms to tcp://10.157.169.65:38615
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:33352
distributed.core - INFO - Removing comms to tcp://10.157.169.65:33352
distributed.scheduler - INFO - Register tcp://10.157.169.65:38051
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:38051
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.157.169.65:46414
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:46414
distributed.core - INFO - Starting established connection

So, it looks like some workers are being removed and new workers are being added back to replace those workers. Around 30 seconds after this, I see multiple tornado errors:

tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
    comm = yield self.pool.connect(self.addr)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
    connection_args=self.connection_args,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:33352' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e39007c88>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
    quiet_exceptions=EnvironmentError,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
    result = yield result
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2496, in scatter
    yield self.replicate(keys=keys, workers=workers, n=n)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2903, in replicate
    for w, who_has in gathers.items()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
    result_list.append(f.result())
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
    comm = yield self.pool.connect(self.addr)
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
    connection_args=self.connection_args,
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
    _raise(error)
  File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
    raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused

It looks like the time out/connection refused are referring to the same ipaddress/ports where it was trying to Removing comms from earlier up above. I can't seem to resolve this.

In case it matters, I am running these commands in a jupyterlab=0.35.5 that is running next to the dask-scheduler and we are running tornado=6.0.2 with dask=1.2.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions