-
-
Notifications
You must be signed in to change notification settings - Fork 750
Description
User observed behaviour: Key is shown as processing on a worker even though no progress is observed. If worker is investigated directly, the worker is unaware of the key itself
The following tries to outline the chain of events leading up to this deadlock. Workers are called Alice, Bob and Chuck. K is the key/task to be stolen. other highlighted words are either data collections or methods of the workstealing plugin or the TaskState object.
I was not able to reproduce it yet, this is purely theoretical and based off the observations in #5366
Race condition:
-
Scheduler: Transition K to processing
Kisprocessing_onAliceKisstealable
-
Balance: K is stealable => Maybe steal
-
steal request to Alice ID:1
- K removed from
stealable - K put
in_flightw/ victim:Alice, thief:Chuck
- K removed from
-
Balance: K not in
stealable=> Skip -
Scheduler: Transition processing->released
- K removed from
in_flight - K removed from
stealable
- K removed from
-
Scheduler: Transition ...->processing
- K
processing_onBob - K added to
stealable
- K
From here there are two scenarios possible which are both buggy although with different severity
Scenario A:
steal-confirmfrom Alice ID: 1- K not in
WorkStealing.in_flight - raise KeyError and abort
move_task_confirm - BUG
in_flight_occupancynever readjusted
- K not in
Everything afterwards works as expected with the exception of wrong occupancy
Scenario B:
-
Balance: K in
stealable=>maybe_steal -
steal request to
BobID: 2- K removed from
stealable - K put in
in_flight; victim:Bob, thief:Chuck
- K removed from
-
Response from Alice ID: 1
- Pop K from
in_flight - K not processing_on victim (Alice)
- Recalculate occupancy
- Pop K from
-
Response from Bob ID: 2
- K not in
in_flight - raise KeyError and abort
move_task_confirm - BUG
Bobconfirmed the steal and forgot the task.- Scheduler never registers steal confirmation
Chuckis never assigned asprocessing_on- Task is never sent to
Chuck - BUG Following attempts to steal return state: None which result in a
already-computingmessage
- K not in