Transition tracing for scheduler task transitions

The worker currently implements a tracing system to link cause and effect and follow all transitions that were triggered by a given stimulus. This trace ID is usually referred to as `stimulus_id`.

The scheduler generates some of these `stimulus_id`s and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event.

The most difficult thing to figure out is where to generate the unique `stimulus_id`s since if we just keep on passing the IDs through every call, every transition would be linked by the same ID.

My thinking is that new events/stimulus IDs should generated on the following events (please correct me if I miss anything)

* (Scheduler) Update graph
* (Scheduler) Remove worker
* (Scheduler) Remove client
* (Scheduler) steal-request
* (Scheduler) delete-worker-data
* (Scheduler) everything AMM does
* (Client/Scheduler) cancel key
* (Worker) task-finished
* (Worker) task-erred
* (Worker) add_keys (new replica)

All other state modifying handlers should accept a stimulus ID and forward it accordingly through the transition enginer.

Similar to the worker, the story should not only filter on keys but also stim IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Transition tracing for scheduler task transitions #5849

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Transition tracing for scheduler task transitions #5849

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions