Skip to content

stable-2.14.4#11618

Merged
adleong merged 8 commits intorelease/stable-2.14from
alex/stable-2.14.4
Nov 16, 2023
Merged

stable-2.14.4#11618
adleong merged 8 commits intorelease/stable-2.14from
alex/stable-2.14.4

Conversation

@adleong
Copy link
Member

@adleong adleong commented Nov 16, 2023

This stable release improves observability for the control plane by adding
additional logging to the destination controller and by adding histograms which
can detect Kubernetes informer lag. It also adds the ability to configure
protocol detection.

  • Improved logging in the destination controller by adding the client pod's
    name to the logging context. This will improve visibility into the messages
    sent and received by the control plane from a specific proxy (#11532)
  • helm: Introduce configurable values for protocol detection (#11536)
  • Fixed an issue where the Destination controller could stop processing service
    profile updates, if a proxy subscribed to those updates stops reading them;
    this is a followup to the issue [Add update queue to endpoint translator #11491] fixed in stable-2.14.2 (#11546)
  • In the Destination controller, added informer lag histogram metrics to track
    whenever the Kubernetes objects watched by the controller are falling behind
    the state in the kube-apiserver (#11534)
  • proxy: Fix grpc_status metric labels for inbound traffic

mateiidavid and others added 7 commits November 15, 2023 23:37
This change allows users to configure protocol detection timeout values
(outbound and inbound). Certain environments may find that protocol
detection inhibits debugging and makes it harder to reason with a
client's behaviour. In such cases (and not only) it may be deseriable to
change the default protocol detection timeout to a higher value than the
default 10s.

Through this change, users may configure their timeout values either
with install-time settings or through annotations; this follows our
usual proxy configuration model. The proxy uses different timeout values
for the inbound and outbound stacks (even though they use the same
default value) and this change respects that by adding two separate
fields.

Signed-off-by: Matei David <[email protected]>
Prior to setting an enormous value to disable protocol detection, the
field was meant to be configurable. In the refactor, the annotation name
stayed the same instead of reflecting the change in the contract (i.e.
not configurable but toggled). Additionally, there were two types in the
proxy partials.

Signed-off-by: Matei David <[email protected]>
In order to detect if the destination controller's k8s informers have fallen behind, we add a histogram for each resource type.  These histograms track the delta between when an update to a resource occurs and when the destination controller processes that update.  We do this by looking at the timestamps on the managed fields of the resource and looking for the most recent update and comparing that to the current time.

The histogram metrics are of the form `{kind}_informer_lag_ms_bucket_*`.

* We record a value only for updates, not for adds or deletes.  This is because when the controller starts up, it will populate its cache with an add for each resource in the cluster and the delta between the last updated time of that resource and the current time may be large.  This does not represent informer lag and should not be counted as such.
* When the informer performs resyncs, we get updates where the updated time of the old version is equal to the updated time of the new version.  This does not represent an actual update of the resource itself and so we do not record a value.
* Since we are comparing timestamps set on the manged fields of resources to the current time from the destination controller's system clock, the accuracy of these metrics depends on clock drift being minimal across the cluster.
* We use histogram buckets which range from 500ms to about 17 minutes.  In my testing, an informer lag of 500ms-1000ms is typical.  However, we wish to have enough buckets to identify cases where the informer is lagged significantly behind.

Signed-off-by: Alex Leong <[email protected]>
When the destination controller logs about receiving or sending messages to a data plane proxy, there is no information in the log about which data plane pod it is communicating with.  This can make it difficult to diagnose issues which span the data plane and control plane.

We add a `pod` field to the context token that proxies include in requests to the destination controller.  We add this pod name to the logging context so that it shows up in log messages.  In order to accomplish this, we had to plumb through logging context in a few places where it previously had not been.  This gives us a more complete logging context and more information in each log message.

An example log message with this fuller logging context is:

```
time="2023-10-24T00:14:09Z" level=debug msg="Sending destination add: add:{addrs:{addr:{ip:{ipv4:183762990}  port:8080}  weight:10000  metric_labels:{key:\"control_plane_ns\"  value:\"linkerd\"}  metric_labels:{key:\"deployment\"  value:\"voting\"}  metric_labels:{key:\"pod\"  value:\"voting-7475cb974c-2crt5\"}  metric_labels:{key:\"pod_template_hash\"  value:\"7475cb974c\"}  metric_labels:{key:\"serviceaccount\"  value:\"voting\"}  tls_identity:{dns_like_identity:{name:\"voting.emojivoto.serviceaccount.identity.linkerd.cluster.local\"}}  protocol_hint:{h2:{}}}  metric_labels:{key:\"namespace\"  value:\"emojivoto\"}  metric_labels:{key:\"service\"  value:\"voting-svc\"}}" addr=":8086" component=endpoint-translator context-ns=emojivoto context-pod=web-767f4484fd-wmpvf remote="10.244.0.65:52786" service="voting-svc.emojivoto.svc.cluster.local:8080"
```

Note the `context-pod` field.

Additionally, we have tested this when no pod field is included in the context token (e.g. when handling requests from a pod which does not yet add this field) and confirmed that the `context-pod` log field is empty, but no errors occur.

Signed-off-by: Alex Leong <[email protected]>
#11491 changed the EndpointTranslator to use a queue to avoid calling `Send` on a gRPC stream directly from an informer callback goroutine.  This change updates the ProfileTranslator in the same way, adding a queue to ensure we do not block the informer thread.

Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
@adleong adleong requested a review from a team as a code owner November 16, 2023 00:08
Signed-off-by: Alex Leong <[email protected]>
@adleong adleong merged commit 41747e8 into release/stable-2.14 Nov 16, 2023
@adleong adleong deleted the alex/stable-2.14.4 branch November 16, 2023 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants