Add informer lag histograms#11534

Merged

alpeb merged 6 commits intomainfrom

alex/informer-lag

Nov 8, 2023

Member

adleong commented Oct 25, 2023

In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.

The histogram metrics are of the form {kind}_informer_lag_ms_bucket_*.

We record a value only for updates, not for adds or deletes. This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large. This does not represent informer lag and should not be counted as such.
When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version. This does not represent an actual update of the resource itself and so we do not record a value.
Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster.
We use histogram buckets which range from 500ms to about 17 minutes. In my testing, an informer lag of 500ms-1000ms is typical. However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind.


          Add informer lag histograms

abc5954

Signed-off-by: Alex Leong <[email protected]>

adleong requested a review from a team as a code owner

October 25, 2023 23:05

alpeb reviewed

View reviewed changes

Member

alpeb left a comment

This is so good! Tests perfectly 👍

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/k8s/k8s.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go Show resolved Hide resolved


          feedback

Signed-off-by: Alex Leong <[email protected]>

alpeb approved these changes

View reviewed changes

Member

olix0r commented Oct 30, 2023

The histogram metrics are of the form {kind}_informer_lag_ms_bucket_*.

We haven't been rigorous about this previously, but prom best practices encourage use of seconds (i.e. with fractional float values) for all time units.

adleong added 2 commits

October 30, 2023 19:01


          Merge branch 'main' into alex/informer-lag

7e51223


          Use secs for lag histrogram units

3bf21a8

Signed-off-by: Alex Leong <[email protected]>

alpeb reviewed

View reviewed changes

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved

controller/api/destination/watcher/prometheus.go

+              	endpointsliceInformerLag = promauto.NewHistogram(
+              		prometheus.HistogramOpts{
+              			Name:    "endpointslice_informer_lag_secs",
+              			Help:    "The amount of time between when an EndpointSlice resource is updated and when an informer observes it",

Member

alpeb Oct 30, 2023

Suggested change

      
            			Help:    "The amount of time between when an EndpointSlice resource is updated and when an informer observes it",
          
            			Help:    "The amount of time between when an EndpointSlices resource is updated and when an informer observes it",

Member

olix0r Nov 8, 2023

I believe it is correct as it is:

:; k get endpointslices.discovery.k8s.io -o yaml |grep kind
  kind: EndpointSlice
kind: List


          make resource kind plural in lag histograms

9de968d

Signed-off-by: Alex Leong <[email protected]>

olix0r reviewed

View reviewed changes

controller/api/destination/watcher/prometheus.go Outdated Show resolved Hide resolved


          seconds

c4cb721

Signed-off-by: Alex Leong <[email protected]>

olix0r approved these changes

View reviewed changes

alpeb merged commit 1e605dd into main

alpeb deleted the alex/informer-lag branch

November 8, 2023 19:56

alpeb added a commit that referenced this pull request


          Change notes for edge-23.11.2

a0cedb7

## edge-23.11.2

This edge release contains observability improvements and bug fixes to the
Destination controller, and a refinement to the multicluster gateway resolution
logic.

* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in edge-23.10.3 ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the objects tracked are falling behind the state in the
  kube-apiserver ([#11534])
* In the multicluster service mirror, extended the target gateway resolution
  logic to take into account all the possible IPs a hostname might resolve to,
  not just the first one (thanks @MrFreezeex!) ([#11499])
* Added probes to the debug container to appease environments requiring probes
  for all containers ([#11308])

alpeb mentioned this pull request

Change notes for edge-23.11.2 #11600

Merged

alpeb added a commit that referenced this pull request


          Change notes for edge-23.11.2 (#11600)

4018b2f

## edge-23.11.2

This edge release contains observability improvements and bug fixes to the
Destination controller, and a refinement to the multicluster gateway resolution
logic.

* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in [edge-23.10.3] ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the Kubernetes objects watched by the controller are falling behind
  the state in the kube-apiserver ([#11534])
* In the multicluster service mirror, extended the target gateway resolution
  logic to take into account all the possible IPs a hostname might resolve to,
  rather than just the first one (thanks @MrFreezeex!) ([#11499])
* Added probes to the debug container to appease environments requiring probes
  for all containers ([#11308])

[edge-23.10.3]: https://github.com/linkerd/linkerd2/releases/tag/edge-23.10.3
[#11546]: #11546
[#11534]: #11534
[#11499]: #11499
[#11308]: #11308

adleong added a commit that referenced this pull request


          Add informer lag histograms (#11534)

5e7db1d

In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.

The histogram metrics are of the form `{kind}_informer_lag_ms_bucket_*`.

* We record a value only for updates, not for adds or deletes. This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large. This does not represent informer lag and should not be counted as such.
* When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version. This does not represent an actual update of the resource itself and so we do not record a value.
* Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster.
* We use histogram buckets which range from 500ms to about 17 minutes. In my testing, an informer lag of 500ms-1000ms is typical. However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind.

Signed-off-by: Alex Leong <[email protected]>

adleong mentioned this pull request

stable-2.14.4 #11618

Merged

adleong added a commit that referenced this pull request


          stable-2.14.4 (#11618)

41747e8

This stable release improves observability for the control plane by adding
additional logging to the destination controller and by adding histograms which
can detect Kubernetes informer lag. It also adds the ability to configure
protocol detection.

* Improved logging in the destination controller by adding the client pod's
  name to the logging context. This will improve visibility into the messages
  sent and received by the control plane from a specific proxy ([#11532])
* helm: Introduce configurable values for protocol detection ([#11536])
* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in [stable-2.14.2] ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the Kubernetes objects watched by the controller are falling behind
  the state in the kube-apiserver ([#11534])
* proxy: Fix grpc_status metric labels for inbound traffic

[stable-2.14.2]: https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.2
[#11532]: #11532
[#11536]: #11536
[#11546]: #11546
[#11534]: #11534

---------

Signed-off-by: Matei David <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
Co-authored-by: Matei David <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet