Skip to content

Transition tracing for scheduler task transitions #5849

@fjetter

Description

@fjetter

The worker currently implements a tracing system to link cause and effect and follow all transitions that were triggered by a given stimulus. This trace ID is usually referred to as stimulus_id.

The scheduler generates some of these stimulus_ids and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event.

The most difficult thing to figure out is where to generate the unique stimulus_ids since if we just keep on passing the IDs through every call, every transition would be linked by the same ID.

My thinking is that new events/stimulus IDs should generated on the following events (please correct me if I miss anything)

  • (Scheduler) Update graph
  • (Scheduler) Remove worker
  • (Scheduler) Remove client
  • (Scheduler) steal-request
  • (Scheduler) delete-worker-data
  • (Scheduler) everything AMM does
  • (Client/Scheduler) cancel key
  • (Worker) task-finished
  • (Worker) task-erred
  • (Worker) add_keys (new replica)

All other state modifying handlers should accept a stimulus ID and forward it accordingly through the transition enginer.

Similar to the worker, the story should not only filter on keys but also stim IDs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions