What are the usecases for `Scheduler.handle_missing_data`?

## Introduction

The scheduler offers a handler [`handle_missing_data`](https://github.com/dask/distributed/blob/ba39915ff754c1683c744f99b6643f9832db0375/distributed/scheduler.py#L4654-L4686) that is called in
one place on the worker, during the response handling of [`gather_dep`](https://github.com/dask/distributed/blob/ba39915ff754c1683c744f99b6643f9832db0375/distributed/worker.py#L3352-L3362)

This handling has been known to cause stability and deadlocking issues. In
recent refactoring attempts the `Worker.find_missing` functionality has been
flagged as a source of complexity and I'm wondering if we should remove this
entire system in favor of a much simpler mechanism.

Looking at the scheduler side, reading code and doc string, there are a couple
of different use cases where this handler is supposed to be called. It always
assumes that the signal originates from a worker, called `worker`, which reports
that another worker `errant_worker` is not in possession of a key `key`.

The doc string explains that this can occur in two situations
1. `worker` is operating on stale data, i.e.
   `worker.TaskState.who_has`/`Worker.has_what` includes stale information, such
   that after requesting `Worker.get_data` on `errant_worker` it returns a
   response without data for `key`.
2. `errant_worker` is indeed dead which is likely inferred by `worker` because
   it encountered a network error.

If the scheduler already knows about the death of `errant_worker`, this request is
ignored.

Otherwise, the scheduler accepts this signal as the ultimate truth and removes
all state indicating that `key` was on `errant_worker`. There is some transition
logic in place to heal and/or raise the problem as is appropriate.

The circumstances under which a worker actually calls this handler is a bit
harder to describe. I am much more interested in whether or not this handler on
scheduler side makes sense.

---

Investigating use case 1.) - The worker is operating on stale data. This can
have two reasons

1.a) The data is indeed stale. Something happened that caused us to request
`key` to be forgotten on `errant_worker` but we didn't tell `worker`.

I believe the only valid reason for this is AMM requesting `worker` to forget
the data. All other instances where we would properly release keys due to
transitions, `worker` would eventually be asked to release the key as well.

I would argue that this case is not something a worker should escalate to the
scheduler. If anything, the worker should ask the scheduler for the most recent,
accurate information and try again. Particularly with the changes proposed in
https://github.com/dask/distributed/pull/6435 where the worker indeed removes no
longer accurate entries, this use case feels obsolete

1.b) The scheduler has faulty information

I am struggling to come up with a situation where the scheduler is indeed
operating on false information since workers are never allowed to forget data
without the scheduler telling them so. It is possible for workers to have more
data than the scheduler knows (e.g. because messages where not, yet, delivered)
but it's hard to come up with a scenario where the worker lost data.
This could obviously be possible if a worker would loose its data due to
external factors, e.g. disk is lost or a user deliberately removes keys from
`Worker.data`. I would argue in this situation `errant_worker` should notice
that an external party is messing with its state and it should escalate to the
scheduler, iff we even want to deal with this edge case.


2.) `errant_worker` is indeed dead. The scheduler would eventually notice
because `Scheduler.stream_comms[errant_worker]` would be aborted (at the latest
after `distributed.comm.timeouts.tcp` seconds)
The only reason why `worker` should ping the scheduler on this event is if we
want to accelerate recovery. From a consistency perspective, this should not be
required


## Conclusion

I am starting to believe that `worker` should simply remove `errant_worker` from
it's internal bookkeeping and rely on the scheduler to eventually clean up. We
would accept that a `worker.TaskState` in state `fetch` is allowed to have an
empty `who_has` and we'd rely on the scheduler to clean this up eventually.

The only functionality we'd need on `Worker` is a mechanism to eventually
refresh the worker's `who_has`/`has_what` information. 

This could be a system that is *not* connected to the state machine directly
that ensures that all, or subsets of, data locality information
(who_has/has_what) is updated _eventually_. This generic system would also
benefit `busy` tasks.

I believe this reduced coupling would significantly reduce complexity. I'm
wondering if I missed a valid use case.


cc @gjoseph92 @crusaderky 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

What are the usecases for `Scheduler.handle_missing_data`? #6445

Introduction

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

What are the usecases for Scheduler.handle_missing_data? #6445

Description

Introduction

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What are the usecases for `Scheduler.handle_missing_data`? #6445