Scheduler stops itself due to idle timeout, even though workers should still be working

A user has reported that a long-running scheduler (up for ~18h) appears to have shut itself down because it thought it was idle. However, the scheduler had plenty of work left to do.

The scheduler logs looked like:
```
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.35s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.60s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.08s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.24s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.48s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.76s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.74s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.22s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.61s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.93s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 5.44s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.scheduler - INFO - Scheduler closing after being idle for 300.00 s
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.10.95:34519', name: tcp://172.31.10.95:34519, status: running, memory: 139350, processing: 0>
distributed.core - INFO - Removing comms to tcp://172.31.10.95:34519
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.31.188:40365', name: tcp://172.31.31.188:40365, status: running, memory: 131425, processing: 1>
distributed.core - INFO - Removing comms to tcp://172.31.31.188:40365
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.7.23:33299', name: tcp://172.31.7.23:33299, status: running, memory: 133166, processing: 2>
distributed.core - INFO - Removing comms to tcp://172.31.7.23:33299
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.11.172:34905', name: tcp://172.31.11.172:34905, status: running, memory: 141044, processing: 3>
distributed.core - INFO - Removing comms to tcp://172.31.11.172:34905
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.32.43:40345', name: tcp://172.31.32.43:40345, status: running, memory: 128135, processing: 4>
distributed.core - INFO - Removing comms to tcp://172.31.32.43:40345
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.2.153:40059', name: tcp://172.31.2.153:40059, status: running, memory: 135440, processing: 5>
distributed.core - INFO - Removing comms to tcp://172.31.2.153:40059
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.33.86:43825', name: tcp://172.31.33.86:43825, status: running, memory: 133801, processing: 7>
distributed.core - INFO - Removing comms to tcp://172.31.33.86:43825
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.30.46:40461', name: tcp://172.31.30.46:40461, status: running, memory: 131706, processing: 3>
distributed.core - INFO - Removing comms to tcp://172.31.30.46:40461
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.43.190:45543', name: tcp://172.31.43.190:45543, status: running, memory: 135540, processing: 2>
distributed.core - INFO - Removing comms to tcp://172.31.43.190:45543
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.0.126:44369', name: tcp://172.31.0.126:44369, status: running, memory: 134745, processing: 12>
distributed.core - INFO - Removing comms to tcp://172.31.0.126:44369
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.31.190:44953', name: tcp://172.31.31.190:44953, status: running, memory: 134171, processing: 12>
distributed.core - INFO - Removing comms to tcp://172.31.31.190:44953
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.26.162:45605', name: tcp://172.31.26.162:45605, status: running, memory: 126044, processing: 14>
distributed.core - INFO - Removing comms to tcp://172.31.26.162:45605
```
Without timestamps (https://github.com/dask/distributed/issues/4762), we don't know if those "event loop unresponsive" pauses were related and happened immediately prior to the idle shutdown, or if they were long before.

The confusing thing is that the idle shutdown only happens if no workers are processing any tasks (and there are no unrunnable tasks). Meaning that, during every check for 300s, either:
1) The scheduler thought no workers had processing tasks
2) No workers were connected

`check_idle` code, for reference: https://github.com/dask/distributed/blob/2a3ee566c85ba6c19a91b0c415e990da6812437a/distributed/scheduler.py#L7865-L7882

Yet in the logs immediately following from `close`, we can see:
1. Almost all workers _do_ have some tasks processing (`processing: 2`, `processing: 12`, etc.)
2. There are clearly workers connected

One thing I do notice is that [`time()`](https://docs.python.org/3/library/time.html#time.time) is not monotonic (we should use [`time.monotonic()`](https://docs.python.org/3/library/time.html#time.monotonic) instead, xref https://github.com/dask/distributed/issues/4528). So in theory, if the system clock changed (possible on that long-running of a scheduler?), we might not be waiting the full 300s. That still doesn't explain how we got in a situation where the scheduler thought there were no workers processing, though—but if that situation happened to overlap with a system clock forward-jump, the shutdown could get triggered immediately instead of actually requiring 300s to pass.

One other thing to note is that (according to logs), over the lifetime of the cluster, workers connected 114 times, and disconnected 79 times. So there was a lot of worker churn. When a worker disconnects, you'd see logs like:
```
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.36.136:32833', name: tcp://172.31.36.136:32833, status: closing, memory: 15652, processing: 2214>
distributed.core - INFO - Removing comms to tcp://172.31.36.136:32833
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://172.31.35.227:8786 remote=tcp://172.31.39.2:46620>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/distributed/batched.py", line 93, in _background_send
    nbytes = yield self.comm.write(
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/tcp.py", line 248, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://172.31.35.227:8786 remote=tcp://172.31.45.150:36594>
```

I mention this also because there are known issues around BatchedSend reconnecting (https://github.com/dask/distributed/issues/5481, review of https://github.com/dask/distributed/pull/5457 for more issues). I'm not sure whether it matters in this particular case though.

Note that these numbers leave 35 workers unaccounted for. That is, after the `Scheduler closing after being idle for 300.00 s` message, we see 12 `Remove worker` messages. However, by my count of how many times the scheduler registered a worker, we should have seen 47 `Remove worker` messages at the end. (When the cluster first started, 47 workers connected, so this number lines up.)

cc @sevberg

	def check_idle(self):
	parent: SchedulerState = cast(SchedulerState, self)
	ws: WorkerState
	if (
	any([ws._processing for ws in parent._workers_dv.values()])
	or parent._unrunnable
	):
	self.idle_since = None
	return
	elif not self.idle_since:
	self.idle_since = time()

	if time() > self.idle_since + self.idle_timeout:
	logger.info(
	"Scheduler closing after being idle for %s",
	format_time(self.idle_timeout),
	)
	self.loop.add_callback(self.close)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Scheduler stops itself due to idle timeout, even though workers should still be working #5675

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Scheduler stops itself due to idle timeout, even though workers should still be working #5675

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions