Mass key replication may sit idly for many seconds

This is a follow-up from #6342 

1. A worker, a, owns task x in memory
2. compute-task issues `{op: compute-task, key: y, who_has: {x: [a]}}` to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x: `{op: acquire-replicas, who_has: {x: [a]}}`
3. Worker b is already transferring a very large piece of data from a; which will take many seconds to finish. When the compute-task/acquire-replica request arrives, x is parked in fetch state because a is in flight.
4. This is not the case for worker c, which quickly acquires x and informs the scheduler.
5. Shortly afterwards, b receives `{op: compute-task, key: z, who_has: {x: [a, c]}}`. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command  `{op: acquire-replicas, who_has: {x: [a, c]}}`
6. **Expected result:** x is transitioned immediately from fetch to flight and acquired from c
**Actual result:** x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.

#6342 introduces a new test, ``test_new_replica_while_all_workers_in_flight``, currently marked as `@skip`, to test this.

It was initially proposed to enable a fetch->fetch transition as follows:
```python
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2685,9 +2685,6 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore
 
-        if ts.state == finish:
-            return {}, []
-
         start = ts.state
         func = self._TRANSITIONS_TABLE.get((start, cast(TaskStateState, finish)))
 
```
The above doesn't work, as it sends the key into missing state which is then "rescued" by find_missing 1 second later.

CC @fjetter 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Mass key replication may sit idly for many seconds #6446

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Mass key replication may sit idly for many seconds #6446

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions