Conversation
Signed-off-by: Alex Leong <[email protected]>
alpeb
reviewed
Oct 27, 2023
Member
alpeb
left a comment
There was a problem hiding this comment.
This is so good! Tests perfectly 👍
Signed-off-by: Alex Leong <[email protected]>
alpeb
approved these changes
Oct 27, 2023
Member
We haven't been rigorous about this previously, but prom best practices encourage use of seconds (i.e. with fractional float values) for all time units. |
Signed-off-by: Alex Leong <[email protected]>
alpeb
reviewed
Oct 30, 2023
| endpointsliceInformerLag = promauto.NewHistogram( | ||
| prometheus.HistogramOpts{ | ||
| Name: "endpointslice_informer_lag_secs", | ||
| Help: "The amount of time between when an EndpointSlice resource is updated and when an informer observes it", |
Member
There was a problem hiding this comment.
Suggested change
| Help: "The amount of time between when an EndpointSlice resource is updated and when an informer observes it", | |
| Help: "The amount of time between when an EndpointSlices resource is updated and when an informer observes it", |
Member
There was a problem hiding this comment.
I believe it is correct as it is:
:; k get endpointslices.discovery.k8s.io -o yaml |grep kind
kind: EndpointSlice
kind: List
Signed-off-by: Alex Leong <[email protected]>
olix0r
reviewed
Oct 31, 2023
Signed-off-by: Alex Leong <[email protected]>
olix0r
approved these changes
Nov 8, 2023
alpeb
added a commit
that referenced
this pull request
Nov 9, 2023
## edge-23.11.2 This edge release contains observability improvements and bug fixes to the Destination controller, and a refinement to the multicluster gateway resolution logic. * Fixed an issue where the Destination controller could stop processing service profile updates, if a proxy subscribed to those updates stops reading them; this is a followup to the issue [#11491] fixed in edge-23.10.3 ([#11546]) * In the Destination controller, added informer lag histogram metrics to track whenever the objects tracked are falling behind the state in the kube-apiserver ([#11534]) * In the multicluster service mirror, extended the target gateway resolution logic to take into account all the possible IPs a hostname might resolve to, not just the first one (thanks @MrFreezeex!) ([#11499]) * Added probes to the debug container to appease environments requiring probes for all containers ([#11308])
alpeb
added a commit
that referenced
this pull request
Nov 9, 2023
## edge-23.11.2 This edge release contains observability improvements and bug fixes to the Destination controller, and a refinement to the multicluster gateway resolution logic. * Fixed an issue where the Destination controller could stop processing service profile updates, if a proxy subscribed to those updates stops reading them; this is a followup to the issue [#11491] fixed in [edge-23.10.3] ([#11546]) * In the Destination controller, added informer lag histogram metrics to track whenever the Kubernetes objects watched by the controller are falling behind the state in the kube-apiserver ([#11534]) * In the multicluster service mirror, extended the target gateway resolution logic to take into account all the possible IPs a hostname might resolve to, rather than just the first one (thanks @MrFreezeex!) ([#11499]) * Added probes to the debug container to appease environments requiring probes for all containers ([#11308]) [edge-23.10.3]: https://github.com/linkerd/linkerd2/releases/tag/edge-23.10.3 [#11546]: #11546 [#11534]: #11534 [#11499]: #11499 [#11308]: #11308
adleong
added a commit
that referenced
this pull request
Nov 16, 2023
In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.
The histogram metrics are of the form `{kind}_informer_lag_ms_bucket_*`.
* We record a value only for updates, not for adds or deletes. This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large. This does not represent informer lag and should not be counted as such.
* When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version. This does not represent an actual update of the resource itself and so we do not record a value.
* Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster.
* We use histogram buckets which range from 500ms to about 17 minutes. In my testing, an informer lag of 500ms-1000ms is typical. However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind.
Signed-off-by: Alex Leong <[email protected]>
Merged
adleong
added a commit
that referenced
this pull request
Nov 16, 2023
This stable release improves observability for the control plane by adding additional logging to the destination controller and by adding histograms which can detect Kubernetes informer lag. It also adds the ability to configure protocol detection. * Improved logging in the destination controller by adding the client pod's name to the logging context. This will improve visibility into the messages sent and received by the control plane from a specific proxy ([#11532]) * helm: Introduce configurable values for protocol detection ([#11536]) * Fixed an issue where the Destination controller could stop processing service profile updates, if a proxy subscribed to those updates stops reading them; this is a followup to the issue [#11491] fixed in [stable-2.14.2] ([#11546]) * In the Destination controller, added informer lag histogram metrics to track whenever the Kubernetes objects watched by the controller are falling behind the state in the kube-apiserver ([#11534]) * proxy: Fix grpc_status metric labels for inbound traffic [stable-2.14.2]: https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.2 [#11532]: #11532 [#11536]: #11536 [#11546]: #11546 [#11534]: #11534 --------- Signed-off-by: Matei David <[email protected]> Signed-off-by: Alex Leong <[email protected]> Co-authored-by: Matei David <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.
The histogram metrics are of the form
{kind}_informer_lag_ms_bucket_*.