-
-
Notifications
You must be signed in to change notification settings - Fork 749
Closed
Description
This is a follow-up from #6342
- A worker, a, owns task x in memory
- compute-task issues
{op: compute-task, key: y, who_has: {x: [a]}}to workers b, c, and maybe more. Alternatively, an AMM policy suggests creating 2+ more replicas of x:{op: acquire-replicas, who_has: {x: [a]}} - Worker b is already transferring a very large piece of data from a; which will take many seconds to finish. When the compute-task/acquire-replica request arrives, x is parked in fetch state because a is in flight.
- This is not the case for worker c, which quickly acquires x and informs the scheduler.
- Shortly afterwards, b receives
{op: compute-task, key: z, who_has: {x: [a, c]}}. Alternatively, 2 seconds have passed and AMM reiterates that b should acquire a replica of x with a new command{op: acquire-replicas, who_has: {x: [a, c]}} - Expected result: x is transitioned immediately from fetch to flight and acquired from c
Actual result: x sits idly in fetch state until something else triggers ensure_communicating. In the worst case scenario, it may be worker a coming out of flight 10-15 seconds later.
#6342 introduces a new test, test_new_replica_while_all_workers_in_flight, currently marked as @skip, to test this.
It was initially proposed to enable a fetch->fetch transition as follows:
--- a/distributed/worker.py
+++ b/distributed/worker.py
@@ -2685,9 +2685,6 @@ class Worker(ServerNode):
assert not args
finish, *args = finish # type: ignore
- if ts.state == finish:
- return {}, []
-
start = ts.state
func = self._TRANSITIONS_TABLE.get((start, cast(TaskStateState, finish)))
The above doesn't work, as it sends the key into missing state which is then "rescued" by find_missing 1 second later.
CC @fjetter
Metadata
Metadata
Assignees
Labels
No labels