Skip to content

refresh-who-has can break the worker state machine #6528

@crusaderky

Description

@crusaderky

This is a regression introduced in #6342 + #6348.

  1. Two tasks, x and y, are only available on a busy worker.
    The worker sends request-refresh-who-has to the scheduler.
  2. The scheduler responds with refresh-who-has, stating that x has become missing, while y has gained an additional replica.
  3. The handler for RefreshWhoHasEvent empties x.who_has and recommends a transition to missing.
  4. Before the recommendation can be implemented, the same event handler invokes _ensure_communicating to allow y to transition to flight. This in turn pops x from data_needed - but x has an empty who_has, which trips an AssertionError in _ensure_communicating.
  5. This in turn trips @fail_hard, which violently shuts down the whole worker.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions