Skip to content

Scheduler stops itself due to idle timeout, even though workers should still be working #5675

@gjoseph92

Description

@gjoseph92

A user has reported that a long-running scheduler (up for ~18h) appears to have shut itself down because it thought it was idle. However, the scheduler had plenty of work left to do.

The scheduler logs looked like:

distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.35s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.60s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.08s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.24s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.48s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.76s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 3.74s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.22s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.61s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 4.93s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.core - INFO - Event loop was unresponsive in Scheduler for 5.44s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
distributed.scheduler - INFO - Scheduler closing after being idle for 300.00 s
distributed.scheduler - INFO - Scheduler closing...
distributed.scheduler - INFO - Scheduler closing all comms
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.10.95:34519', name: tcp://172.31.10.95:34519, status: running, memory: 139350, processing: 0>
distributed.core - INFO - Removing comms to tcp://172.31.10.95:34519
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.31.188:40365', name: tcp://172.31.31.188:40365, status: running, memory: 131425, processing: 1>
distributed.core - INFO - Removing comms to tcp://172.31.31.188:40365
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.7.23:33299', name: tcp://172.31.7.23:33299, status: running, memory: 133166, processing: 2>
distributed.core - INFO - Removing comms to tcp://172.31.7.23:33299
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.11.172:34905', name: tcp://172.31.11.172:34905, status: running, memory: 141044, processing: 3>
distributed.core - INFO - Removing comms to tcp://172.31.11.172:34905
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.32.43:40345', name: tcp://172.31.32.43:40345, status: running, memory: 128135, processing: 4>
distributed.core - INFO - Removing comms to tcp://172.31.32.43:40345
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.2.153:40059', name: tcp://172.31.2.153:40059, status: running, memory: 135440, processing: 5>
distributed.core - INFO - Removing comms to tcp://172.31.2.153:40059
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.33.86:43825', name: tcp://172.31.33.86:43825, status: running, memory: 133801, processing: 7>
distributed.core - INFO - Removing comms to tcp://172.31.33.86:43825
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.30.46:40461', name: tcp://172.31.30.46:40461, status: running, memory: 131706, processing: 3>
distributed.core - INFO - Removing comms to tcp://172.31.30.46:40461
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.43.190:45543', name: tcp://172.31.43.190:45543, status: running, memory: 135540, processing: 2>
distributed.core - INFO - Removing comms to tcp://172.31.43.190:45543
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.0.126:44369', name: tcp://172.31.0.126:44369, status: running, memory: 134745, processing: 12>
distributed.core - INFO - Removing comms to tcp://172.31.0.126:44369
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.31.190:44953', name: tcp://172.31.31.190:44953, status: running, memory: 134171, processing: 12>
distributed.core - INFO - Removing comms to tcp://172.31.31.190:44953
distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.26.162:45605', name: tcp://172.31.26.162:45605, status: running, memory: 126044, processing: 14>
distributed.core - INFO - Removing comms to tcp://172.31.26.162:45605

Without timestamps (#4762), we don't know if those "event loop unresponsive" pauses were related and happened immediately prior to the idle shutdown, or if they were long before.

The confusing thing is that the idle shutdown only happens if no workers are processing any tasks (and there are no unrunnable tasks). Meaning that, during every check for 300s, either:

  1. The scheduler thought no workers had processing tasks
  2. No workers were connected

check_idle code, for reference:

def check_idle(self):
parent: SchedulerState = cast(SchedulerState, self)
ws: WorkerState
if (
any([ws._processing for ws in parent._workers_dv.values()])
or parent._unrunnable
):
self.idle_since = None
return
elif not self.idle_since:
self.idle_since = time()
if time() > self.idle_since + self.idle_timeout:
logger.info(
"Scheduler closing after being idle for %s",
format_time(self.idle_timeout),
)
self.loop.add_callback(self.close)

Yet in the logs immediately following from close, we can see:

  1. Almost all workers do have some tasks processing (processing: 2, processing: 12, etc.)
  2. There are clearly workers connected

One thing I do notice is that time() is not monotonic (we should use time.monotonic() instead, xref #4528). So in theory, if the system clock changed (possible on that long-running of a scheduler?), we might not be waiting the full 300s. That still doesn't explain how we got in a situation where the scheduler thought there were no workers processing, though—but if that situation happened to overlap with a system clock forward-jump, the shutdown could get triggered immediately instead of actually requiring 300s to pass.

One other thing to note is that (according to logs), over the lifetime of the cluster, workers connected 114 times, and disconnected 79 times. So there was a lot of worker churn. When a worker disconnects, you'd see logs like:

distributed.scheduler - INFO - Remove worker <WorkerState 'tcp://172.31.36.136:32833', name: tcp://172.31.36.136:32833, status: closing, memory: 15652, processing: 2214>
distributed.core - INFO - Removing comms to tcp://172.31.36.136:32833
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://172.31.35.227:8786 remote=tcp://172.31.39.2:46620>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/distributed/batched.py", line 93, in _background_send
    nbytes = yield self.comm.write(
  File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/usr/local/lib/python3.8/site-packages/distributed/comm/tcp.py", line 248, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
distributed.batched - INFO - Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://172.31.35.227:8786 remote=tcp://172.31.45.150:36594>

I mention this also because there are known issues around BatchedSend reconnecting (#5481, review of #5457 for more issues). I'm not sure whether it matters in this particular case though.

Note that these numbers leave 35 workers unaccounted for. That is, after the Scheduler closing after being idle for 300.00 s message, we see 12 Remove worker messages. However, by my count of how many times the scheduler registered a worker, we should have seen 47 Remove worker messages at the end. (When the cluster first started, 47 workers connected, so this number lines up.)

cc @sevberg

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokenneeds infoNeeds further information from the userstabilityIssue or feature related to cluster stability (e.g. deadlock)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions