Skip to content

[metrics] Add achilles_processing_duration_seconds metric for tracking e2e object processing latency#41

Merged
harveyxia merged 9 commits intomainfrom
metrics-processing-duration
May 19, 2025
Merged

[metrics] Add achilles_processing_duration_seconds metric for tracking e2e object processing latency#41
harveyxia merged 9 commits intomainfrom
metrics-processing-duration

Conversation

@harveyxia
Copy link
Copy Markdown
Collaborator

@harveyxia harveyxia commented May 8, 2025

Motivation

Controller-runtime's has built in duration metrics but none of them track e2e object processing latency qualified by success or failure:

  1. workqueue_queue_duration_seconds_bucket: events can be requeued and rate limited—duration spent waiting due to rate limits are not measured
  2. workqueue_work_duration_seconds_bucket: is not qualified with processing success or failure

We want a metric that closely approximates what matters to end users or consuming systems—the amount of time it takes for a spec change to be successfully processed.

Implementation

Instruments a new histogram metric achilles_processing_duration_seconds tracking the time from when the controller receives an update to an object's spec to when that change is processed.

The processing time span for a given spec change is correlated using metadata.generation. A single reconciliation can process multiple generations, which is handled in the instrumentation.

Details

  1. To avoid double counting failed processing, the metric will only observe a failed (name, namespace, generation) a single time. Once observed, the data entry is marked with a boolean indicating that it's failure has already been observed, allowing subsequent observations to skip this one.
  2. To avoid infinitely accumulating memory, data points are deleted as soon as a successful reconciliation is observed.
  3. A b-tree is used as the backing data structure, allow ordered range queries and efficient storage and retrieval.

Appendix

A more accurate metric would mark start time at admission time. But achieving this would require that consuming controllers deploy a webhook and further more, that the webhook share memory with the main controller binary.

I think this requirement is too onerous for the added accuracy. This metric doesn't measure latency incurred by the Kubernetes control plane itself (i.e. the time taken from when Kubernetes receives the object update to when it reflects that update back to WATCHers).

This latency can't be affected by the controller anyways + we have separate metrics for understanding k8s control plane performance.

✅ Checks

  • CI tests (if present) are passing
  • Adheres to code style for repo
  • Contributor License Agreement (CLA) completed if not a Reddit employee

@harveyxia harveyxia changed the title Metrics processing duration [metrics] Add achilles_processing_duration_seconds metric for tracking e2e object processing latency May 8, 2025
@harveyxia harveyxia force-pushed the metrics-processing-duration branch from 89b7656 to 662eb6b Compare May 8, 2025 15:23
@harveyxia harveyxia force-pushed the metrics-processing-duration branch from 3081845 to d52a38f Compare May 8, 2025 16:32
@harveyxia harveyxia marked this pull request as ready for review May 13, 2025 13:41
@harveyxia harveyxia requested a review from a team as a code owner May 13, 2025 13:41
Copy link
Copy Markdown
Contributor

@kdorosh kdorosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple suggestions; very neat stuff! stoked to have this kind of visibility in the SDK

@harveyxia harveyxia requested a review from kdorosh May 13, 2025 17:23
Copy link
Copy Markdown
Contributor

@karanthukral karanthukral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job on this! took me a bit to follow the approach but makes sense

@harveyxia harveyxia merged commit 8b04ff2 into main May 19, 2025
1 check passed
@harveyxia harveyxia deleted the metrics-processing-duration branch May 19, 2025 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants