Add update queue to endpoint translator#11491

Merged

adleong merged 7 commits intomainfrom

Oct 18, 2023

Member

adleong commented Oct 16, 2023 •

edited

Loading

When a grpc client of the destination.Get API initiates a request but then doesn't read off of that stream, the HTTP2 stream flow control window will fill up and eventually exert backpressure on the destination controller. This manifests as calls to Send on the stream blocking. Since Send is called synchronously from the client-go informer callback (by way of the endpoint translator), this blocks the informer callback and prevents all further informer calllbacks from firing. This causes the destination controller to stop sending updates to any of its clients.

We add a queue in the endpoint translator so that when it gets an update from the informer callback, that update is queued and we avoid potentially blocking the informer callback. Each endpoint translator spawns a goroutine to process this queue and call Send. If there is not capacity in this queue (e.g. because a client has stopped reading and we are experiencing backpressure) then we terminate the stream.

adleong added 2 commits

October 16, 2023 21:14


          Add update queue to endpoint translator

7ce5395

Signed-off-by: Alex Leong <[email protected]>


          Update tests

c51b0db

Signed-off-by: Alex Leong <[email protected]>

alpeb reviewed

View reviewed changes

controller/api/destination/endpoint_translator.go Show resolved Hide resolved


          remove unused mutex and fix tests

f273074

Signed-off-by: Alex Leong <[email protected]>

adleong marked this pull request as ready for review

October 17, 2023 20:49

adleong requested a review from a team as a code owner

October 17, 2023 20:49

adleong changed the title ~~[DNM] Add update queue to endpoint translator~~ Add update queue to endpoint translator

DavidMcLaughlin reviewed

View reviewed changes

controller/api/destination/endpoint_translator.go Outdated

+              			// The endStream channel has already been closed so no action is
+              			// necessary.
+              		default:
+              			et.log.Error("endpoint update queue full; ending stream")

Contributor

DavidMcLaughlin Oct 18, 2023

What information do we have about the stream here? Can we log additional diagnostic information? E.g. the address of the proxy.

Can we also explain what we expect to happen by ending the stream? E.g. is this something the user needs to come to us with an issue on? Or do they just need to know that the proxy should just reconnect organically.

DavidMcLaughlin reviewed

View reviewed changes

controller/api/destination/endpoint_translator.go

+              	et.log.Debugf("Sending destination no endpoints: %+v", u)
+              	if err := et.stream.Send(u); err != nil {
+              		et.log.Debugf("Failed to send address update: %s", err)

Contributor

DavidMcLaughlin Oct 18, 2023

I know this is previous code that we've moved, but why would an error would be logged at Debug level?

Member

olix0r Oct 18, 2023

We expect to see this every time a client proxy restarts/closes a stream.

mateiidavid reviewed

View reviewed changes

controller/api/destination/endpoint_translator.go Outdated Show resolved Hide resolved

controller/api/destination/server.go Outdated Show resolved Hide resolved

controller/api/destination/endpoint_translator.go Show resolved Hide resolved


          Fix typos and words

8f012a0

Signed-off-by: Alex Leong <[email protected]>

mateiidavid approved these changes

View reviewed changes

adleong added 3 commits

October 18, 2023 18:20


          Fix race condition in test

654643b

Signed-off-by: Alex Leong <[email protected]>


          Remove unused test helper

d63cef5

Signed-off-by: Alex Leong <[email protected]>


          More test fixes

421e333

Signed-off-by: Alex Leong <[email protected]>

olix0r approved these changes

View reviewed changes

adleong merged commit 357a1d3 into main

adleong deleted the alex/queuwu branch

October 18, 2023 19:34

mateiidavid pushed a commit that referenced this pull request


          Add update queue to endpoint translator (#11491)

12e7b86

When a grpc client of the destination.Get API initiates a request but then doesn't read off of that stream, the HTTP2 stream flow control window will fill up and eventually exert backpressure on the destination controller.  This manifests as calls to `Send` on the stream blocking.  Since `Send` is called synchronously from the client-go informer callback (by way of the endpoint translator), this blocks the informer callback and prevents all further informer calllbacks from firing.  This causes the destination controller to stop sending updates to any of its clients.

We add a queue in the endpoint translator so that when it gets an update from the informer callback, that update is queued and we avoid potentially blocking the informer callback.  Each endpoint translator spawns a goroutine to process this queue and call `Send`.  If there is not capacity in this queue (e.g. because a client has stopped reading and we are experiencing backpressure) then we terminate the stream.

Signed-off-by: Alex Leong <[email protected]>

mateiidavid mentioned this pull request

stable-2.14.2 #11539

Merged

adleong mentioned this pull request

Add queuing to profile translator #11546

Merged

alpeb pushed a commit that referenced this pull request


          Add queuing to profile translator (#11546)

71635cb

#11491 changed the EndpointTranslator to use a queue to avoid calling `Send` on a gRPC stream directly from an informer callback goroutine.  This change updates the ProfileTranslator in the same way, adding a queue to ensure we do not block the informer thread.

Signed-off-by: Alex Leong <[email protected]>

alpeb added a commit that referenced this pull request


          Change notes for edge-23.11.2

a0cedb7

## edge-23.11.2

This edge release contains observability improvements and bug fixes to the
Destination controller, and a refinement to the multicluster gateway resolution
logic.

* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in edge-23.10.3 ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the objects tracked are falling behind the state in the
  kube-apiserver ([#11534])
* In the multicluster service mirror, extended the target gateway resolution
  logic to take into account all the possible IPs a hostname might resolve to,
  not just the first one (thanks @MrFreezeex!) ([#11499])
* Added probes to the debug container to appease environments requiring probes
  for all containers ([#11308])

alpeb mentioned this pull request

Change notes for edge-23.11.2 #11600

Merged

alpeb added a commit that referenced this pull request


          Change notes for edge-23.11.2 (#11600)

4018b2f

## edge-23.11.2

This edge release contains observability improvements and bug fixes to the
Destination controller, and a refinement to the multicluster gateway resolution
logic.

* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in [edge-23.10.3] ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the Kubernetes objects watched by the controller are falling behind
  the state in the kube-apiserver ([#11534])
* In the multicluster service mirror, extended the target gateway resolution
  logic to take into account all the possible IPs a hostname might resolve to,
  rather than just the first one (thanks @MrFreezeex!) ([#11499])
* Added probes to the debug container to appease environments requiring probes
  for all containers ([#11308])

[edge-23.10.3]: https://github.com/linkerd/linkerd2/releases/tag/edge-23.10.3
[#11546]: #11546
[#11534]: #11534
[#11499]: #11499
[#11308]: #11308

adleong added a commit that referenced this pull request


          Add queuing to profile translator (#11546)

1d32bd9

#11491 changed the EndpointTranslator to use a queue to avoid calling `Send` on a gRPC stream directly from an informer callback goroutine.  This change updates the ProfileTranslator in the same way, adding a queue to ensure we do not block the informer thread.

Signed-off-by: Alex Leong <[email protected]>

adleong mentioned this pull request

stable-2.14.4 #11618

Merged

adleong added a commit that referenced this pull request


          stable-2.14.4 (#11618)

41747e8

This stable release improves observability for the control plane by adding
additional logging to the destination controller and by adding histograms which
can detect Kubernetes informer lag. It also adds the ability to configure
protocol detection.

* Improved logging in the destination controller by adding the client pod's
  name to the logging context. This will improve visibility into the messages
  sent and received by the control plane from a specific proxy ([#11532])
* helm: Introduce configurable values for protocol detection ([#11536])
* Fixed an issue where the Destination controller could stop processing service
  profile updates, if a proxy subscribed to those updates stops reading them;
  this is a followup to the issue [#11491] fixed in [stable-2.14.2] ([#11546])
* In the Destination controller, added informer lag histogram metrics to track
  whenever the Kubernetes objects watched by the controller are falling behind
  the state in the kube-apiserver ([#11534])
* proxy: Fix grpc_status metric labels for inbound traffic

[stable-2.14.2]: https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.2
[#11532]: #11532
[#11536]: #11536
[#11546]: #11546
[#11534]: #11534

---------

Signed-off-by: Matei David <[email protected]>
Signed-off-by: Alex Leong <[email protected]>
Co-authored-by: Matei David <[email protected]>

alpeb mentioned this pull request

Seeing errors/panics when trying to upgrade past 2.14.1 #12010

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet