-
-
Notifications
You must be signed in to change notification settings - Fork 750
Description
A user has reported that a long-running scheduler (up for ~18h) appears to have shut itself down because it thought it was idle. However, the scheduler had plenty of work left to do.
The scheduler logs looked like:
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.35s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.60s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.08s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.24s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.48s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.76s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.74s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.22s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.61s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.93s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 5.44s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.scheduler - INFO - Scheduler closing after being idle for 300.00 s
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.10.95:34519', name: tcp://172.31.10.95:34519, status: running, memory: 139350, processing: 0>
distributed.core - INFO - Removing comms to tcp://172.31.10.95:34519
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.31.188:40365', name: tcp://172.31.31.188:40365, status: running, memory: 131425, processing: 1>
distributed.core - INFO - Removing comms to tcp://172.31.31.188:40365
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.7.23:33299', name: tcp://172.31.7.23:33299, status: running, memory: 133166, processing: 2>
distributed.core - INFO - Removing comms to tcp://172.31.7.23:33299
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.11.172:34905', name: tcp://172.31.11.172:34905, status: running, memory: 141044, processing: 3>
distributed.core - INFO - Removing comms to tcp://172.31.11.172:34905
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.32.43:40345', name: tcp://172.31.32.43:40345, status: running, memory: 128135, processing: 4>
distributed.core - INFO - Removing comms to tcp://172.31.32.43:40345
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.2.153:40059', name: tcp://172.31.2.153:40059, status: running, memory: 135440, processing: 5>
distributed.core - INFO - Removing comms to tcp://172.31.2.153:40059
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.33.86:43825', name: tcp://172.31.33.86:43825, status: running, memory: 133801, processing: 7>
distributed.core - INFO - Removing comms to tcp://172.31.33.86:43825
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.30.46:40461', name: tcp://172.31.30.46:40461, status: running, memory: 131706, processing: 3>
distributed.core - INFO - Removing comms to tcp://172.31.30.46:40461
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.43.190:45543', name: tcp://172.31.43.190:45543, status: running, memory: 135540, processing: 2>
distributed.core - INFO - Removing comms to tcp://172.31.43.190:45543
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.0.126:44369', name: tcp://172.31.0.126:44369, status: running, memory: 134745, processing: 12>
distributed.core - INFO - Removing comms to tcp://172.31.0.126:44369
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.31.190:44953', name: tcp://172.31.31.190:44953, status: running, memory: 134171, processing: 12>
distributed.core - INFO - Removing comms to tcp://172.31.31.190:44953
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.26.162:45605', name: tcp://172.31.26.162:45605, status: running, memory: 126044, processing: 14>
distributed.core - INFO - Removing comms to tcp://172.31.26.162:45605
Without timestamps (#4762), we don't know if those "event loop unresponsive" pauses were related and happened immediately prior to the idle shutdown, or if they were long before.
The confusing thing is that the idle shutdown only happens if no workers are processing any tasks (and there are no unrunnable tasks). Meaning that, during every check for 300s, either:
- The scheduler thought no workers had processing tasks
- No workers were connected
check_idle code, for reference:
distributed/distributed/scheduler.py
Lines 7865 to 7882 in 2a3ee56
| def check_idle(self): | |
| parent: SchedulerState = cast(SchedulerState, self) | |
| ws: WorkerState | |
| if ( | |
| any([ws._processing for ws in parent._workers_dv.values()]) | |
| or parent._unrunnable | |
| ): | |
| self.idle_since = None | |
| return | |
| elif not self.idle_since: | |
| self.idle_since = time() | |
| if time() > self.idle_since + self.idle_timeout: | |
| logger.info( | |
| "Scheduler closing after being idle for %s", | |
| format_time(self.idle_timeout), | |
| ) | |
| self.loop.add_callback(self.close) |
Yet in the logs immediately following from close, we can see:
- Almost all workers do have some tasks processing (
processing: 2,processing: 12, etc.) - There are clearly workers connected
One thing I do notice is that time() is not monotonic (we should use time.monotonic() instead, xref #4528). So in theory, if the system clock changed (possible on that long-running of a scheduler?), we might not be waiting the full 300s. That still doesn't explain how we got in a situation where the scheduler thought there were no workers processing, though—but if that situation happened to overlap with a system clock forward-jump, the shutdown could get triggered immediately instead of actually requiring 300s to pass.
One other thing to note is that (according to logs), over the lifetime of the cluster, workers connected 114 times, and disconnected 79 times. So there was a lot of worker churn. When a worker disconnects, you'd see logs like:
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.36.136:32833', name: tcp://172.31.36.136:32833, status: closing, memory: 15652, processing: 2214>
distributed.core - INFO - Removing comms to tcp://172.31.36.136:32833
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://172.31.35.227:8786 remote=tcp://172.31.39.2:46620>
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/distributed/batched.py", line 93, in _background_send
nbytes = yield self.comm.write(
File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/usr/local/lib/python3.8/site-packages/distributed/comm/tcp.py", line 248, in write
raise CommClosedError()
distributed.comm.core.CommClosedError
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://172.31.35.227:8786 remote=tcp://172.31.45.150:36594>
I mention this also because there are known issues around BatchedSend reconnecting (#5481, review of #5457 for more issues). I'm not sure whether it matters in this particular case though.
Note that these numbers leave 35 workers unaccounted for. That is, after the Scheduler closing after being idle for 300.00 s message, we see 12 Remove worker messages. However, by my count of how many times the scheduler registered a worker, we should have seen 47 Remove worker messages at the end. (When the cluster first started, 47 workers connected, so this number lines up.)
cc @sevberg