Skip to content

Race condition in work stealing resulting in deadlock #5370

@fjetter

Description

@fjetter

User observed behaviour: Key is shown as processing on a worker even though no progress is observed. If worker is investigated directly, the worker is unaware of the key itself

The following tries to outline the chain of events leading up to this deadlock. Workers are called Alice, Bob and Chuck. K is the key/task to be stolen. other highlighted words are either data collections or methods of the workstealing plugin or the TaskState object.

I was not able to reproduce it yet, this is purely theoretical and based off the observations in #5366

Race condition:

  • Scheduler: Transition K to processing

    • K is processing_on Alice
    • K is stealable
  • Balance: K is stealable => Maybe steal

  • steal request to Alice ID:1

    • K removed from stealable
    • K put in_flight w/ victim: Alice, thief: Chuck
  • Balance: K not in stealable => Skip

  • Scheduler: Transition processing->released

    • K removed from in_flight
    • K removed from stealable
  • Scheduler: Transition ...->processing

    • K processing_on Bob
    • K added to stealable

From here there are two scenarios possible which are both buggy although with different severity

Scenario A:

  • steal-confirm from Alice ID: 1
    • K not in WorkStealing.in_flight
    • raise KeyError and abort move_task_confirm
    • BUG in_flight_occupancy never readjusted

Everything afterwards works as expected with the exception of wrong occupancy

Scenario B:

  • Balance: K in stealable => maybe_steal

  • steal request to Bob ID: 2

    • K removed from stealable
    • K put in in_flight; victim: Bob, thief: Chuck
  • Response from Alice ID: 1

    • Pop K from in_flight
    • K not processing_on victim (Alice)
    • Recalculate occupancy
  • Response from Bob ID: 2

    • K not in in_flight
    • raise KeyError and abort move_task_confirm
    • BUG
      • Bob confirmed the steal and forgot the task.
      • Scheduler never registers steal confirmation
      • Chuck is never assigned as processing_on
      • Task is never sent to Chuck
      • BUG Following attempts to steal return state: None which result in a already-computing message

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions