stable-2.14.4 by adleong · Pull Request #11618 · linkerd/linkerd2

adleong · 2023-11-16T00:08:47Z

This stable release improves observability for the control plane by adding
additional logging to the destination controller and by adding histograms which
can detect Kubernetes informer lag. It also adds the ability to configure
protocol detection.

Improved logging in the destination controller by adding the client pod's
name to the logging context. This will improve visibility into the messages
sent and received by the control plane from a specific proxy (#11532)
helm: Introduce configurable values for protocol detection (#11536)
Fixed an issue where the Destination controller could stop processing service
profile updates, if a proxy subscribed to those updates stops reading them;
this is a followup to the issue [Add update queue to endpoint translator #11491] fixed in stable-2.14.2 (#11546)
In the Destination controller, added informer lag histogram metrics to track
whenever the Kubernetes objects watched by the controller are falling behind
the state in the kube-apiserver (#11534)
proxy: Fix grpc_status metric labels for inbound traffic

This change allows users to configure protocol detection timeout values (outbound and inbound). Certain environments may find that protocol detection inhibits debugging and makes it harder to reason with a client's behaviour. In such cases (and not only) it may be deseriable to change the default protocol detection timeout to a higher value than the default 10s. Through this change, users may configure their timeout values either with install-time settings or through annotations; this follows our usual proxy configuration model. The proxy uses different timeout values for the inbound and outbound stacks (even though they use the same default value) and this change respects that by adding two separate fields. Signed-off-by: Matei David <[email protected]>

Prior to setting an enormous value to disable protocol detection, the field was meant to be configurable. In the refactor, the annotation name stayed the same instead of reflecting the change in the contract (i.e. not configurable but toggled). Additionally, there were two types in the proxy partials. Signed-off-by: Matei David <[email protected]>

In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type. These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update. We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time. The histogram metrics are of the form `{kind}_informer_lag_ms_bucket_*`. * We record a value only for updates, not for adds or deletes. This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large. This does not represent informer lag and should not be counted as such. * When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version. This does not represent an actual update of the resource itself and so we do not record a value. * Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster. * We use histogram buckets which range from 500ms to about 17 minutes. In my testing, an informer lag of 500ms-1000ms is typical. However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind. Signed-off-by: Alex Leong <[email protected]>

When the destination controller logs about receiving or sending messages to a data plane proxy, there is no information in the log about which data plane pod it is communicating with. This can make it difficult to diagnose issues which span the data plane and control plane. We add a `pod` field to the context token that proxies include in requests to the destination controller. We add this pod name to the logging context so that it shows up in log messages. In order to accomplish this, we had to plumb through logging context in a few places where it previously had not been. This gives us a more complete logging context and more information in each log message. An example log message with this fuller logging context is: ``` time="2023-10-24T00:14:09Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183762990} port:8080} weight:10000 metric_labels:{key:\"control_plane_ns\" value:\"linkerd\"} metric_labels:{key:\"deployment\" value:\"voting\"} metric_labels:{key:\"pod\" value:\"voting-7475cb974c-2crt5\"} metric_labels:{key:\"pod_template_hash\" value:\"7475cb974c\"} metric_labels:{key:\"serviceaccount\" value:\"voting\"} tls_identity:{dns_like_identity:{name:\"voting.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}} protocol_hint:{h2:{}}} metric_labels:{key:\"namespace\" value:\"emojivoto\"} metric_labels:{key:\"service\" value:\"voting-svc\"}}" addr=":8086" component=endpoint-translator context-ns=emojivoto context-pod=web-767f4484fd-wmpvf remote="10.244.0.65:52786" service="voting-svc.emojivoto.svc.cluster.local:8080" ``` Note the `context-pod` field. Additionally, we have tested this when no pod field is included in the context token (e.g. when handling requests from a pod which does not yet add this field) and confirmed that the `context-pod` log field is empty, but no errors occur. Signed-off-by: Alex Leong <[email protected]>

#11491 changed the EndpointTranslator to use a queue to avoid calling `Send` on a gRPC stream directly from an informer callback goroutine. This change updates the ProfileTranslator in the same way, adding a queue to ensure we do not block the informer thread. Signed-off-by: Alex Leong <[email protected]>

Signed-off-by: Alex Leong <[email protected]>

mateiidavid and others added 7 commits November 15, 2023 23:37

fmt

1c80ae3

Signed-off-by: Alex Leong <[email protected]>

stable-2.14.4

26e0c65

Signed-off-by: Alex Leong <[email protected]>

adleong requested a review from a team as a code owner November 16, 2023 00:08

proxy v2.210.3

2e05526

Signed-off-by: Alex Leong <[email protected]>

olix0r approved these changes Nov 16, 2023

View reviewed changes

hawkw approved these changes Nov 16, 2023

View reviewed changes

adleong merged commit 41747e8 into release/stable-2.14 Nov 16, 2023

adleong deleted the alex/stable-2.14.4 branch November 16, 2023 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stable-2.14.4#11618

stable-2.14.4#11618
adleong merged 8 commits intorelease/stable-2.14from
alex/stable-2.14.4

adleong commented Nov 16, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

adleong commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adleong commented Nov 16, 2023 •

edited

Loading