Conversation
The gate middleware controls a service's readiness so that it can exert backpressure. This is used, for instance, by the circuit breaker module so that an endpoint can go into an unavailable state after the breaker has been tripped and be marked available again as it recovers. This change fixes a bug in that recovery scenario: when the gate is in a Limited state (i.e. when the circuit breaker puts an endpoint into Probation to test its availability), and caller (i.e. the balancer) is waiting for the endpoint to leave probation, the balancer may never be notified that the endpoint has left its probation state. To fix this, we update the gate controller to definitively close its inner Semaphore when transitioning out of a limited state -- dropping the semaphore in the sender doesn't close it when it's being held by a receiver. This issue is somewhat masked by the balancer's polling behavior, where endpoint states are only advanced as requests are processed. It seems likely, however, that this scenario could be encountered in the wild when circuit breaking is enabled on a service.
This comment was marked as resolved.
This comment was marked as resolved.
hawkw
approved these changes
Oct 30, 2023
Contributor
hawkw
left a comment
There was a problem hiding this comment.
This looks correct to me, and the change is pretty straightforward! New tests look good, too. It might be nice to document the behavior around closing the semaphore, but no blockers from me.
hawkw
approved these changes
Nov 1, 2023
olix0r
added a commit
to linkerd/linkerd2
that referenced
this pull request
Nov 2, 2023
This release includes several bugfixes. Notably, inbound proxies would not properly reflect grpc-status in metrics by default. Furthermore, proxies now long warnings when they receive unexpected error responses from the control plane. --- * chore: change `rust-toolchain` file to toml format (linkerd/linkerd2-proxy#2487) * gate: Detect disconnected inner services in readiness (linkerd/linkerd2-proxy#2491) * Bump ahash to v0.8.5 (linkerd/linkerd2-proxy#2498) * gate: Fix readiness deadlock (linkerd/linkerd2-proxy#2493) * Log a warning when the controller clients receive an error (linkerd/linkerd2-proxy#2499) * inbound: Fix gRPC response classification (linkerd/linkerd2-proxy#2496) Signed-off-by: Oliver Gould <[email protected]>
olix0r
added a commit
to linkerd/linkerd2
that referenced
this pull request
Nov 2, 2023
This release includes several bugfixes. Notably, inbound proxies would not properly reflect grpc-status in metrics by default. Furthermore, proxies now long warnings when they receive unexpected error responses from the control plane. --- * chore: change `rust-toolchain` file to toml format (linkerd/linkerd2-proxy#2487) * gate: Detect disconnected inner services in readiness (linkerd/linkerd2-proxy#2491) * Bump ahash to v0.8.5 (linkerd/linkerd2-proxy#2498) * gate: Fix readiness deadlock (linkerd/linkerd2-proxy#2493) * Log a warning when the controller clients receive an error (linkerd/linkerd2-proxy#2499) * inbound: Fix gRPC response classification (linkerd/linkerd2-proxy#2496) Signed-off-by: Oliver Gould <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The gate middleware controls a service's readiness so that it can exert backpressure. This is used, for instance, by the circuit breaker module so that an endpoint can go into an unavailable state after the breaker has been tripped and be marked available again as it recovers.
This change fixes a bug in that recovery scenario: when the gate is in a Limited state (i.e. when the circuit breaker puts an endpoint into Probation to test its availability), and caller (i.e. the balancer) is waiting for the endpoint to leave probation, the balancer may never be notified that the endpoint has left its probation state.
To fix this, we update the gate controller to definitively close its inner Semaphore when transitioning out of a limited state -- dropping the semaphore in the sender doesn't close it when it's being held by a receiver. This change motivates an update to the gate::Tx API so that it always produces a Semaphore when transitioning to the limited state (rather than take a shared semaphore as a caller).
This issue is somewhat masked by the balancer's polling behavior, where endpoint states are only advanced as requests are processed. It seems likely, however, that this scenario could be encountered in the wild when circuit breaking is enabled on a service.