Skip to content

gate: Fix readiness deadlock#2493

Merged
olix0r merged 3 commits intomainfrom
ver/gatecrash
Nov 1, 2023
Merged

gate: Fix readiness deadlock#2493
olix0r merged 3 commits intomainfrom
ver/gatecrash

Conversation

@olix0r
Copy link
Member

@olix0r olix0r commented Oct 26, 2023

The gate middleware controls a service's readiness so that it can exert backpressure. This is used, for instance, by the circuit breaker module so that an endpoint can go into an unavailable state after the breaker has been tripped and be marked available again as it recovers.

This change fixes a bug in that recovery scenario: when the gate is in a Limited state (i.e. when the circuit breaker puts an endpoint into Probation to test its availability), and caller (i.e. the balancer) is waiting for the endpoint to leave probation, the balancer may never be notified that the endpoint has left its probation state.

To fix this, we update the gate controller to definitively close its inner Semaphore when transitioning out of a limited state -- dropping the semaphore in the sender doesn't close it when it's being held by a receiver. This change motivates an update to the gate::Tx API so that it always produces a Semaphore when transitioning to the limited state (rather than take a shared semaphore as a caller).

This issue is somewhat masked by the balancer's polling behavior, where endpoint states are only advanced as requests are processed. It seems likely, however, that this scenario could be encountered in the wild when circuit breaking is enabled on a service.

The gate middleware controls a service's readiness so that it can exert
backpressure. This is used, for instance, by the circuit breaker module
so that an endpoint can go into an unavailable state after the breaker
has been tripped and be marked available again as it recovers.

This change fixes a bug in that recovery scenario: when the gate is in a
Limited state (i.e. when the circuit breaker puts an endpoint into
Probation to test its availability), and caller (i.e. the balancer) is
waiting for the endpoint to leave probation, the balancer may never be
notified that the endpoint has left its probation state.

To fix this, we update the gate controller to definitively close its
inner Semaphore when transitioning out of a limited state -- dropping
the semaphore in the sender doesn't close it when it's being held by a
receiver.

This issue is somewhat masked by the balancer's polling behavior, where
endpoint states are only advanced as requests are processed. It seems
likely, however, that this scenario could be encountered in the wild
when circuit breaking is enabled on a service.
@olix0r olix0r requested a review from a team as a code owner October 26, 2023 15:11
@olix0r olix0r marked this pull request as draft October 26, 2023 15:22
@hawkw

This comment was marked as resolved.

@olix0r olix0r marked this pull request as ready for review October 26, 2023 18:47
@hawkw hawkw self-requested a review October 26, 2023 19:04
Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me, and the change is pretty straightforward! New tests look good, too. It might be nice to document the behavior around closing the semaphore, but no blockers from me.

@olix0r olix0r merged commit 6999dac into main Nov 1, 2023
@olix0r olix0r deleted the ver/gatecrash branch November 1, 2023 19:30
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Nov 2, 2023
This release includes several bugfixes. Notably, inbound proxies would
not properly reflect grpc-status in metrics by default.

Furthermore, proxies now long warnings when they receive unexpected
error responses from the control plane.

---

* chore: change `rust-toolchain` file to toml format (linkerd/linkerd2-proxy#2487)
* gate: Detect disconnected inner services in readiness (linkerd/linkerd2-proxy#2491)
* Bump ahash to v0.8.5 (linkerd/linkerd2-proxy#2498)
* gate: Fix readiness deadlock (linkerd/linkerd2-proxy#2493)
* Log a warning when the controller clients receive an error (linkerd/linkerd2-proxy#2499)
* inbound: Fix gRPC response classification (linkerd/linkerd2-proxy#2496)

Signed-off-by: Oliver Gould <[email protected]>
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Nov 2, 2023
This release includes several bugfixes. Notably, inbound proxies would
not properly reflect grpc-status in metrics by default.

Furthermore, proxies now long warnings when they receive unexpected
error responses from the control plane.

---

* chore: change `rust-toolchain` file to toml format (linkerd/linkerd2-proxy#2487)
* gate: Detect disconnected inner services in readiness (linkerd/linkerd2-proxy#2491)
* Bump ahash to v0.8.5 (linkerd/linkerd2-proxy#2498)
* gate: Fix readiness deadlock (linkerd/linkerd2-proxy#2493)
* Log a warning when the controller clients receive an error (linkerd/linkerd2-proxy#2499)
* inbound: Fix gRPC response classification (linkerd/linkerd2-proxy#2496)

Signed-off-by: Oliver Gould <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants