Skip to content

buffer: drive inner service to readiness when receiving a request#556

Merged
olix0r merged 5 commits intomaster-tokio-0.2from
eliza/fix-buffer
Jun 11, 2020
Merged

buffer: drive inner service to readiness when receiving a request#556
olix0r merged 5 commits intomaster-tokio-0.2from
eliza/fix-buffer

Conversation

@hawkw
Copy link
Contributor

@hawkw hawkw commented Jun 10, 2020

When linkerd2-buffer was updated to std::future in PR #505, the
behaviour of the buffer was changed subtly. The previous implementation
of the buffer's Dispatch task was poll-based; it implemented its
logic in an implementation of Future::poll with the following
behavior:

  1. Call poll_ready on the underlying service, returning NotReady if
    it is not ready.
  2. Broadcast readiness to senders.
  3. Call poll_next on the channel of requests. If a request is
    received, dispatch it to the service. If no request is ready, return
    NotReady (yield).

Since this was an implementation of the poll function, if we yield due
to the request channel being empty, when we are woken again by the next
request, we resume at the beginning of the poll function.

The new implementation, however, was written using async/await syntax.
Async/await generates a state machine which, when woken after yielding
at an await point, resumes from the same await point it yielded at.
This means that if the new implementation yields because the request
channel is empty, when it is woken by a request, it will not drive
the service to readiness before sending that request. Instead, the
previously acquired readiness from before the task yielded is consumed
by that request.

This behavior is totally fine with regards to the tower-service
readiness contract. All the contract requires is that a call to
poll_ready must return Ready before each call to call. It doesn't
matter if there was a long period of time in between poll_ready and
call, as long as the readiness was not consumed by another call.

However, it is not fine from the perspective of the load balancer.
The load balancer relies on poll_ready to drive updates from service
discovery. This means that if a long period of time passes between when
the balancer becomes ready and when it is called, it may have a stale
service discovery state. Therefore, this change in behavior broke a
large number of the proxy's integration tests that expect changes to
service discovery state to be reflected in a timely manner.

This commit fixes this issue by updating the new dispatch::run
implementation to drive the service to readiness immediately before
dispatching a request. Once the service is driven to readiness
initially, we advertise that it is ready, and call try_recv on the
request channel. If there is a request already in the channel, we can
consume the existing readiness. Otherwise, if there is not a request
immediately available, and we have to wait on the channel, we will drive
the service to readiness again before calling it.

This ensures that service discovery changes are reflected for the next
request after they occur, rather than for the request after that
request.

Additionally, I've re-enabled the integration tests that were broken due
to this bug.

Signed-off-by: Eliza Weisman [email protected]

hawkw added 2 commits June 10, 2020 14:04
When `linkerd2-buffer` was updated to `std::future` in PR #505, the
behaviour of the buffer was changed subtly. The previous implementation
of the buffer's `Dispatch` task was _poll-based_; it implemented its
logic in an implementation of `Future::poll` with the following
behavior:

1. Call `poll_ready` on the underlying service, returning `NotReady` if
   it is not ready.
2. Broadcast readiness to senders.
3. Call `poll_next` on the channel of requests. If a request is
   received, dispatch it to the service. If no request is ready, return
   `NotReady` (yield).

Since this was an implementation of the `poll` function, if we yield due
to the request channel being empty, when we are woken again by the next
request, we resume _at the beginning of the `poll` function_.

The new implementation, however, was written using async/await syntax.
Async/await generates a state machine which, when woken after yielding
at an await point, resumes _from the same await point it yielded at_.
This means that if the new implementation yields because the request
channel is empty, when it is woken by a request, it will **not** drive
the service to readiness before sending that request. Instead, the
previously acquired readiness from before the task yielded is consumed
by that request.

This behavior is totally fine with regards to the `tower-service`
readiness contract. All the contract requires is that a call to
`poll_ready` must return `Ready` before each call to `call`. It doesn't
matter if there was a long period of time in between `poll_ready` and
`call`, as long as the readiness was not consumed by another `call`.

However, it is **not** fine from the perspective of the load balancer.
The load balancer relies on `poll_ready` to drive updates from service
discovery. This means that if a long period of time passes between when
the balancer becomes ready and when it is called, it may have a stale
service discovery state. Therefore, this change in behavior broke a
large number of the proxy's integration tests that expect changes to
service discovery state to be reflected in a timely manner.

This commit fixes this issue by updating the new `dispatch::run`
implementation to drive the service to readiness immediately before
dispatching a request. Once the service is driven to readiness
initially, we advertise that it is ready, and call `try_recv` on the
request channel. If there is a request already in the channel, we can
consume the existing readiness. Otherwise, if there is not a request
immediately available, and we have to wait on the channel, we will drive
the service to readiness again before calling it.

This ensures that service discovery changes are reflected for the next
request after they occur, rather than for the request _after_ that
request.

Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw requested review from a team and olix0r June 10, 2020 21:34
@hawkw hawkw self-assigned this Jun 10, 2020
Copy link
Member

@olix0r olix0r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summarizing conversation we just had

@hawkw hawkw requested a review from olix0r June 10, 2020 22:51
@hawkw hawkw requested a review from a team June 10, 2020 22:56
@olix0r olix0r merged commit 959b7df into master-tokio-0.2 Jun 11, 2020
@olix0r olix0r deleted the eliza/fix-buffer branch June 11, 2020 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants