-
-
Notifications
You must be signed in to change notification settings - Fork 749
Description
The worker currently implements a tracing system to link cause and effect and follow all transitions that were triggered by a given stimulus. This trace ID is usually referred to as stimulus_id.
The scheduler generates some of these stimulus_ids and includes them in RPC calls to the worker in a few places. However, it does not trace its own transitions making it very hard to infer why such a stimulus was generated. Introducing the same system on scheduler side and including the appropriate IDs in requests to the worker would allow us to close the circle and reconstruct a cluster wide history and link all transitions which were caused by an event.
The most difficult thing to figure out is where to generate the unique stimulus_ids since if we just keep on passing the IDs through every call, every transition would be linked by the same ID.
My thinking is that new events/stimulus IDs should generated on the following events (please correct me if I miss anything)
- (Scheduler) Update graph
- (Scheduler) Remove worker
- (Scheduler) Remove client
- (Scheduler) steal-request
- (Scheduler) delete-worker-data
- (Scheduler) everything AMM does
- (Client/Scheduler) cancel key
- (Worker) task-finished
- (Worker) task-erred
- (Worker) add_keys (new replica)
All other state modifying handlers should accept a stimulus ID and forward it accordingly through the transition enginer.
Similar to the worker, the story should not only filter on keys but also stim IDs.