Skip to content

Mass key replication may sit idly for many seconds #6446

@crusaderky

Description

@crusaderky

This is a follow-up from #6342

  1. A worker, a, owns task x in memory
  2. compute-task issues {op: compute-task, key: y, who_has: {x: [a]}} to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x: {op: acquire-replicas, who_has: {x: [a]}}
  3. Worker b is already transferring a very large piece of data from a; which will take many seconds to finish. When the compute-task/acquire-replica request arrives, x is parked in fetch state because a is in flight.
  4. This is not the case for worker c, which quickly acquires x and informs the scheduler.
  5. Shortly afterwards, b receives {op: compute-task, key: z, who_has: {x: [a, c]}}. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command {op: acquire-replicas, who_has: {x: [a, c]}}
  6. Expected result: x is transitioned immediately from fetch to flight and acquired from c
    Actual result: x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.

#6342 introduces a new test, test_new_replica_while_all_workers_in_flight, currently marked as @skip, to test this.

It was initially proposed to enable a fetch->fetch transition as follows:

--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2685,9 +2685,6 @@ class Worker(ServerNode):
             assert not args
             finish, *args = finish  # type: ignore
 
-        if ts.state == finish:
-            return {}, []
-
         start = ts.state
         func = self._TRANSITIONS_TABLE.get((start, cast(TaskStateState, finish)))
 

The above doesn't work, as it sends the key into missing state which is then "rescued" by find_missing 1 second later.

CC @fjetter

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions