-
-
Notifications
You must be signed in to change notification settings - Fork 750
Closed
Description
I have a dask distributed cluster up and running on 40 workers:
dask_client = Client('localhost:8786')
dask_client.restart()
dask_client
I've restarted everything so no tasks are queued and the scheduler log shows:
distributed.scheduler - INFO - Clear task state
I have a large csr sparse matrix that I am scattering to the cluster:
csr_future = dask_client.scatter(csr, broadcast=True)
After a few seconds, I see:
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:38615
distributed.core - INFO - Removing comms to tcp://10.157.169.65:38615
distributed.scheduler - INFO - Remove worker tcp://10.157.169.65:33352
distributed.core - INFO - Removing comms to tcp://10.157.169.65:33352
distributed.scheduler - INFO - Register tcp://10.157.169.65:38051
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:38051
distributed.core - INFO - Starting established connection
distributed.scheduler - INFO - Register tcp://10.157.169.65:46414
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.157.169.65:46414
distributed.core - INFO - Starting established connection
So, it looks like some workers are being removed and new workers are being added back to replace those workers. Around 30 seconds after this, I see multiple tornado errors:
tornado.application - ERROR - Multiple exceptions in yield list
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
quiet_exceptions=EnvironmentError,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
result_list.append(f.result())
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
connection_args=self.connection_args,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:33352' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e39007c88>: ConnectionRefusedError: [Errno 111] Connection refused
distributed.core - ERROR - Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 218, in connect
quiet_exceptions=EnvironmentError,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 412, in handle_comm
result = yield result
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2496, in scatter
yield self.replicate(keys=keys, workers=workers, n=n)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/scheduler.py", line 2903, in replicate
for w, who_has in gathers.items()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 501, in callback
result_list.append(f.result())
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 736, in send_recv_from_rpc
comm = yield self.pool.connect(self.addr)
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/core.py", line 864, in connect
connection_args=self.connection_args,
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/app/home/miniconda3/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 230, in connect
_raise(error)
File "/app/home/miniconda3/lib/python3.7/site-packages/distributed/comm/core.py", line 207, in _raise
raise IOError(msg)
OSError: Timed out trying to connect to 'tcp://10.157.169.65:38615' after 10 s: in <distributed.comm.tcp.TCPConnector object at 0x7f2e3900f4a8>: ConnectionRefusedError: [Errno 111] Connection refused
It looks like the time out/connection refused are referring to the same ipaddress/ports where it was trying to Removing comms from earlier up above. I can't seem to resolve this.
In case it matters, I am running these commands in a jupyterlab=0.35.5 that is running next to the dask-scheduler and we are running tornado=6.0.2 with dask=1.2.2.
songqiqqq, abast, zevaverbach and vivekkrthakur25
Metadata
Metadata
Assignees
Labels
No labels