-
-
Notifications
You must be signed in to change notification settings - Fork 749
Description
Introduction
The scheduler offers a handler handle_missing_data that is called in
one place on the worker, during the response handling of gather_dep
This handling has been known to cause stability and deadlocking issues. In
recent refactoring attempts the Worker.find_missing functionality has been
flagged as a source of complexity and I'm wondering if we should remove this
entire system in favor of a much simpler mechanism.
Looking at the scheduler side, reading code and doc string, there are a couple
of different use cases where this handler is supposed to be called. It always
assumes that the signal originates from a worker, called worker, which reports
that another worker errant_worker is not in possession of a key key.
The doc string explains that this can occur in two situations
workeris operating on stale data, i.e.
worker.TaskState.who_has/Worker.has_whatincludes stale information, such
that after requestingWorker.get_dataonerrant_workerit returns a
response without data forkey.errant_workeris indeed dead which is likely inferred byworkerbecause
it encountered a network error.
If the scheduler already knows about the death of errant_worker, this request is
ignored.
Otherwise, the scheduler accepts this signal as the ultimate truth and removes
all state indicating that key was on errant_worker. There is some transition
logic in place to heal and/or raise the problem as is appropriate.
The circumstances under which a worker actually calls this handler is a bit
harder to describe. I am much more interested in whether or not this handler on
scheduler side makes sense.
Investigating use case 1.) - The worker is operating on stale data. This can
have two reasons
1.a) The data is indeed stale. Something happened that caused us to request
key to be forgotten on errant_worker but we didn't tell worker.
I believe the only valid reason for this is AMM requesting worker to forget
the data. All other instances where we would properly release keys due to
transitions, worker would eventually be asked to release the key as well.
I would argue that this case is not something a worker should escalate to the
scheduler. If anything, the worker should ask the scheduler for the most recent,
accurate information and try again. Particularly with the changes proposed in
#6435 where the worker indeed removes no
longer accurate entries, this use case feels obsolete
1.b) The scheduler has faulty information
I am struggling to come up with a situation where the scheduler is indeed
operating on false information since workers are never allowed to forget data
without the scheduler telling them so. It is possible for workers to have more
data than the scheduler knows (e.g. because messages where not, yet, delivered)
but it's hard to come up with a scenario where the worker lost data.
This could obviously be possible if a worker would loose its data due to
external factors, e.g. disk is lost or a user deliberately removes keys from
Worker.data. I would argue in this situation errant_worker should notice
that an external party is messing with its state and it should escalate to the
scheduler, iff we even want to deal with this edge case.
2.) errant_worker is indeed dead. The scheduler would eventually notice
because Scheduler.stream_comms[errant_worker] would be aborted (at the latest
after distributed.comm.timeouts.tcp seconds)
The only reason why worker should ping the scheduler on this event is if we
want to accelerate recovery. From a consistency perspective, this should not be
required
Conclusion
I am starting to believe that worker should simply remove errant_worker from
it's internal bookkeeping and rely on the scheduler to eventually clean up. We
would accept that a worker.TaskState in state fetch is allowed to have an
empty who_has and we'd rely on the scheduler to clean this up eventually.
The only functionality we'd need on Worker is a mechanism to eventually
refresh the worker's who_has/has_what information.
This could be a system that is not connected to the state machine directly
that ensures that all, or subsets of, data locality information
(who_has/has_what) is updated eventually. This generic system would also
benefit busy tasks.
I believe this reduced coupling would significantly reduce complexity. I'm
wondering if I missed a valid use case.