Skip to content

retry policy: make retry policy pluggable#10099

Closed
yxue wants to merge 31 commits intoenvoyproxy:masterfrom
yxue:retry1
Closed

retry policy: make retry policy pluggable#10099
yxue wants to merge 31 commits intoenvoyproxy:masterfrom
yxue:retry1

Conversation

@yxue
Copy link
Copy Markdown
Member

@yxue yxue commented Feb 19, 2020

Description: Extract functions from retry_state_impl into pluggable class to make retry policy configurable and support composite retry policy.
Risk Level: Med
Testing: Unit test
Docs Changes: N/A
Release Notes: added :ref:typed_config <envoy_api_field_route.RetryPolicy.typed_config> to support pluggable retry policy.
Fixes #Issue: #9946

@repokitteh-read-only
Copy link
Copy Markdown

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/.

🐱

Caused by: #10099 was opened by yxue.

see: more, trace.

@snowp
Copy link
Copy Markdown
Contributor

snowp commented Feb 20, 2020

I think I was imagining the scope of this extension to be a bit bigger, eventually making it able to express the entire current RetryPolicy, including managing retry attempts if desirable (e.g. someone might want certain failures to not count against the retry count).

How about something that looks like this:

class RetryPolicyFactory {
  RetryPolicySharedPtr createRetryPolicy(const Http::HeaderMap& response_headers, <config>);
};

class RetryPolicy {
  // Called when an upstream request has been completed with headers.
  void recordResponseHeaders(const Http::HeaderMap& response_headers);
  // Called when an upstream request failed due to a reset.
  void recordReset(ResetReason reset);

  bool shouldRetry() const; 
  
  // Might need something like this to support the existing hedging behavior, can probably 
  // skip for now and not call this during the hedging path.
  bool wouldRetry(const Http::HeaderMap& response_headers) const;
};

With this kind of interface its now capable of tracking the lifetime of multiple requests, so retry policies that might want to do dynamic retry count based on what happened during the attempts could be implemented.

The key thing to notice is that the router might call it multiple times to determine whether something should or would have been be retried (in order to accept one of multiple hedged requests), so splitting up recording a response with the decision makes it possible to reason about the individual attempts.

Then we can probably update the shouldRetryX in RetryStateImpl to something like

RetryStatus RetryStateImpl::shouldRetryHeaders(const Http::HeaderMap& response_headers,
                                               DoRetryCallback callback) {
  retry_policy_->recordResponseHeaders(response_headers);
  return shouldRetry(retry_policy_.shouldRetry() || wouldRetryFromHeaders(response_headers), callback);
}

My envisioned end goal is basically to be able to express the entire current RetryPolicy as a combination of such extensions, reducing the complexity of the default retry handling while making it more expressive for power users. Hopefully this makes sense!

@mattklein123 @alyssawilk might have thoughts as well

@yxue
Copy link
Copy Markdown
Member Author

yxue commented Feb 20, 2020

I think I was imagining the scope of this extension to be a bit bigger, eventually making it able to express the entire current RetryPolicy, including managing retry attempts if desirable (e.g. someone might want certain failures to not count against the retry count).

That totally makes sense. I misunderstood and thought that we'd like to have another retry plugin. I agree with the envision to make the whole retry policy pluggable.

How about something that looks like this:

class RetryPolicyFactory {
  RetryPolicySharedPtr createRetryPolicy(const Http::HeaderMap& response_headers, <config>);
};

class RetryPolicy {
  // Called when an upstream request has been completed with headers.
  void recordResponseHeaders(const Http::HeaderMap& response_headers);
  // Called when an upstream request failed due to a reset.
  void recordReset(ResetReason reset);

  bool shouldRetry() const; 
  
  // Might need something like this to support the existing hedging behavior, can probably 
  // skip for now and not call this during the hedging path.
  bool wouldRetry(const Http::HeaderMap& response_headers) const;
};

With this kind of interface its now capable of tracking the lifetime of multiple requests, so retry policies that might want to do dynamic retry count based on what happened during the attempts could be implemented.

I didn't get the meaning of tracking the lifetime of multiple requests. My understanding is the pluggable retry policy will be initialized in the RetryStateImpl which is bound to one request. Do you mean the multiple request the first request + multiple retried requests?

Should the create factory interface be:

class RetryPolicyFactory {
  RetryPolicySharedPtr createRetryPolicy(const Http::HeaderMap& request_headers, <config>);
};

since the retry policy will consume response_header in the member function?

@snowp
Copy link
Copy Markdown
Contributor

snowp commented Feb 20, 2020

By tracking requests I mean upstream requests, i.e. each retry attempt. Right now the retry state logic is immutable, so calling wouldRetryHeaders multiple times return a consistent answer. If we want to let the retry policy modify its behavior as it sees more attempts come back, we'll want some kind of explicit API contract for that.

re the request headers yeah my thinking was that all the state that is known before the upstream requests are made should be passed to the factory, as it makes it possible to do some neat optimizations (eg return a noop RetryPolicy when the request headers imply that the request should not be retried)

@yxue
Copy link
Copy Markdown
Member Author

yxue commented Feb 21, 2020

I made some change based on the discussion. I didn't added

  // Might need something like this to support the existing hedging behavior, can probably 
  // skip for now and not call this during the hedging path.
  bool wouldRetry(const Http::HeaderMap& response_headers) const;

since I haven't fully understand what it's used for. If it looks good to you, I am going to work on test and doc. @snowp

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. A few high level questions. Thank you!

/wait

oneof config_type {
google.protobuf.Struct config = 12 [deprecated = true];

google.protobuf.Any typed_config = 13;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this replace the entire policy above? As such should it really live as a oneof next to the policy definition where that gets defined? Or does this use part of the above policy?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry policy extension doesn't use above fields in the existing retry policy. I was thinking to use oneof to include existing retry policy and pluggable retry policy, like this:

message RetryPolicy{
// RetryPolicy definition
}
message PluggableRetryPolicy{
// PluggableRetryPolicy definition
}
message RouteAction{
  ...
  oneof retry_policy_speficier{
    RetryPolicy retry_policy = x;
    PluggableRetryPolicy pluggable_retry_policy = y;
  }
}

I thought we were going to replace existing retry policy with an extension which means we need to deprecate current retry policy definition in the future. Will using the oneof make the deprecation complicate? If putting them in one message, we could just deprecate fields without deprecating the whole message.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we separate the extension message from the above message, another problem is the retry plugin. Currently, the extension retry policy interface doesn't include the retry priority and host selection, which means that user cannot integrate the retry policy extension with retry policy extensions. In order to replace the whole message, I think we should make the extension interface the same as the core retry policy interface.

class RetryPolicy {
public:
// clang-format off
static const uint32_t RETRY_ON_5XX = 0x1;
static const uint32_t RETRY_ON_GATEWAY_ERROR = 0x2;
static const uint32_t RETRY_ON_CONNECT_FAILURE = 0x4;
static const uint32_t RETRY_ON_RETRIABLE_4XX = 0x8;
static const uint32_t RETRY_ON_REFUSED_STREAM = 0x10;
static const uint32_t RETRY_ON_GRPC_CANCELLED = 0x20;
static const uint32_t RETRY_ON_GRPC_DEADLINE_EXCEEDED = 0x40;
static const uint32_t RETRY_ON_GRPC_RESOURCE_EXHAUSTED = 0x80;
static const uint32_t RETRY_ON_GRPC_UNAVAILABLE = 0x100;
static const uint32_t RETRY_ON_GRPC_INTERNAL = 0x200;
static const uint32_t RETRY_ON_RETRIABLE_STATUS_CODES = 0x400;
static const uint32_t RETRY_ON_RESET = 0x800;
static const uint32_t RETRY_ON_RETRIABLE_HEADERS = 0x1000;
// clang-format on
virtual ~RetryPolicy() = default;
/**
* @return std::chrono::milliseconds timeout per retry attempt.
*/
virtual std::chrono::milliseconds perTryTimeout() const PURE;
/**
* @return uint32_t the number of retries to allow against the route.
*/
virtual uint32_t numRetries() const PURE;
/**
* @return uint32_t a local OR of RETRY_ON values above.
*/
virtual uint32_t retryOn() const PURE;
/**
* Initializes a new set of RetryHostPredicates to be used when retrying with this retry policy.
* @return list of RetryHostPredicates to use
*/
virtual std::vector<Upstream::RetryHostPredicateSharedPtr> retryHostPredicates() const PURE;
/**
* Initializes a RetryPriority to be used when retrying with this retry policy.
* @return the RetryPriority to use when determining priority load for retries, or nullptr
* if none should be used.
*/
virtual Upstream::RetryPrioritySharedPtr retryPriority() const PURE;
/**
* Number of times host selection should be reattempted when selecting a host
* for a retry attempt.
*/
virtual uint32_t hostSelectionMaxAttempts() const PURE;
/**
* List of status codes that should trigger a retry when the retriable-status-codes retry
* policy is enabled.
*/
virtual const std::vector<uint32_t>& retriableStatusCodes() const PURE;
/**
* @return std::vector<Http::HeaderMatcherSharedPtr>& list of response header matchers that
* will be checked when the 'retriable-headers' retry policy is enabled.
*/
virtual const std::vector<Http::HeaderMatcherSharedPtr>& retriableHeaders() const PURE;
/**
* @return std::vector<Http::HeaderMatcherSharedPt>& list of request header
* matchers that will be checked before enabling retries.
*/
virtual const std::vector<Http::HeaderMatcherSharedPtr>& retriableRequestHeaders() const PURE;
/**
* @return absl::optional<std::chrono::milliseconds> base retry interval
*/
virtual absl::optional<std::chrono::milliseconds> baseInterval() const PURE;
/**
* @return absl::optional<std::chrono::milliseconds> maximum retry interval
*/
virtual absl::optional<std::chrono::milliseconds> maxInterval() const PURE;
};

At least

  virtual std::chrono::milliseconds perTryTimeout() const PURE; 
  virtual uint32_t numRetries() const PURE; 
  virtual std::vector<Upstream::RetryHostPredicateSharedPtr> retryHostPredicates() const PURE; 
  virtual Upstream::RetryPrioritySharedPtr retryPriority() const PURE; 
  virtual uint32_t hostSelectionMaxAttempts() const PURE; 
  virtual absl::optional<std::chrono::milliseconds> baseInterval() const PURE; 
  virtual absl::optional<std::chrono::milliseconds> maxInterval() const PURE; 

from my perspective. WDYT @mattklein123 @snowp

Signed-off-by: Yan Xue <[email protected]>
yxue and others added 4 commits March 2, 2020 19:54
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
@yxue yxue marked this pull request as ready for review March 3, 2020 17:32
yxue added 2 commits March 9, 2020 20:59
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
@yxue
Copy link
Copy Markdown
Member Author

yxue commented Mar 9, 2020

I remove the retry policy extension implementation from the PR and keep the API only for review. I can send another PR to implement the API after this one gets merged.

@yxue yxue changed the title retry policy: make retry policy configurable retry policy: make retry policy pluggable Mar 10, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for adding the docs. This is great. I've added a bunch of docs/API questions. I would definitely like @snowp to take a look as I'm still unsure about how the plugin and the core retry logic should interact. I'm concerned it's going to be confusing if we let them both coexist.

/wait


// Implementation specific configuration which depends on the implementation being instantiated.
// See the supported retry policy implementations for further documentation.
google.protobuf.Any typed_config = 11;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ref link to the new docs you added from the field docs.

api.v2.route.HeaderMatcher http_request_header_match = 5;

// HTTP request trailer match configuration.
api.v2.route.HeaderMatcher http_request_trailer_match = 6;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity what would this actually be used for?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I don't need the trailer match now. The initial goal is to match the retry policy by request method.

//
// The above configuration shows how to configure different retry policies for
// different HTTP request methods.
repeated ConditionalRetryPolicy retry_policies = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify that this is an or match? Also, given that we have an or match within the predicate, do we really need this so be repeated? Could this be a required singleton message?


Envoy allows :ref:`pluggable retry policy <envoy_api_field_route.RetryPolicy.typed_config>` to be
configured in the :ref:`route configuration <envoy_api_field_route.RouteAction.retry_policy>`. The
extenison applies additional logic to the retry decision logic when the downstream Envoy receives
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo extenison. Same below.

:ref:`retriable_status_codes <envoy_api_field_route.RetryPolicy.retriable_status_codes>`,
:ref:`retriable_headers<envoy_api_field_route.RetryPolicy.retriable_headers>` or
:ref:`retriable_request_headers<envoy_api_field_route.RetryPolicy.retriable_request_headers>`
is specific in :ref:`retry policy <envoy_api_field_route.RouteAction.retry_policy>`, the retry
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/specific/specified

:ref:`retriable_headers<envoy_api_field_route.RetryPolicy.retriable_headers>` or
:ref:`retriable_request_headers<envoy_api_field_route.RetryPolicy.retriable_request_headers>`
is specific in :ref:`retry policy <envoy_api_field_route.RouteAction.retry_policy>`, the retry
decision could be overridden by the core retry policy implementation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part doesn't thrill me and I think could be pretty confusing. Should we just block setting these fields if a retry extension is provided? This is related to my comment/question about where the retry config should live. WDYT?

Comment on lines +148 to +149
The final retry decision is the **OR** value of extension retry policy decision and core retry
policy decision.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contradicts the above statement around overriding, or is that because it's an or? Hmm. I'm pretty inclined to just have a retry plugin take over the whole thing? @snowp WDYT?

Comment on lines +153 to +156
The retry policy extension can count the retry number inside the extension. There is a global
:ref:`configuration<envoy_api_field_route.RetryPolicy.num_retries>` for controlling the retry
count. If the retry number inside the extension is greater than the global retry number, the
downstream Envoy won't retry after exceeding the gobal retry number.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit unclear to me. Can you clarify?

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term I'd like to see that we pull all the existing retry policies into extensions so that we avoid having to deal with both the extension and the core retry policy stuff. My thinking to begin with was that we could incrementally support both so that we didn't need to migrate all the existing matchers, but if that makes for a very confusing API then perhaps we should just make them exclusive and fully flesh out the extension?

MatchPredicate not_match = 3;

// The match configuration will always match.
bool any_match = 4 [(validate.rules).bool = {const: true}];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make more sense to pull this out of the oneof and say that in the absence of any rules (or the absence of a MatchPredicate?) every request will be matched? Not sure if having it be part of the expression tree is helpful

bool any_match = 4 [(validate.rules).bool = {const: true}];

// HTTP request header match configuration.
api.v2.route.HeaderMatcher http_request_header_match = 5;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe skip the http_ prefix?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it.

// conditions documented for
// :ref:`config_http_filters_router_x-envoy-retry-on` and
// :ref:`config_http_filters_router_x-envoy-retry-grpc-on`.
string retry_on = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this make more sense as a repeated field? or is there a reason why we want to stick to the comma separated format?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making this field an array makes sense to me.

MatchPredicate match_config = 1;

// Specifies the conditions under which retry takes place. These are the
// same
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reflow comments

@yxue
Copy link
Copy Markdown
Member Author

yxue commented Mar 10, 2020

I agree on the point that allowing extension retry policy and core retry policy cause a confusing API for users. Let me refactor the PR

  1. Make the core retry policy and extension exclusive in the API
message RetryPolicy{
// RetryPolicy definition
}
message RetryPolicyExtension{
// RetryPolicyExtension definition
}
message RouteAction{
  ...
  oneof retry_policy_speficier{
    RetryPolicy retry_policy = x;
    RetryPolicyExtension retry_policy_extension = y;
  }
}
  1. Move functions to retry policy extension to enable replacement for the core retry policy.
class RetryPolicyExtension{
  /*
   * Record headers and reset during the request lifecycle.
   */
  virtual void recordHeader(const Http::HeaderMap& response_headers);
  virtual void recordReset(const Http::StreamResetReason reset_reason);
  
  /*
   * Functions for querying the retry decision.
   */
  virtual bool wouldRetryFromHeaders() PURE;
  virtual bool wouldRetryFromReset() PURE;

  /* 
   * Functions for retry plugins.
   */
  virtual void onHostAttempted(Upstream::HostDescriptionConstSharedPtr host) PURE;
  virtual bool shouldSelectAnotherHost(const Upstream::Host& host) PURE;
  virtual const Upstream::HealthyAndDegradedLoad& priorityLoadForRetry
        (const Upstream::PrioritySet& priority_set, const Upstream::HealthyAndDegradedLoad& original_priority_load) PURE;
  virtual uint32_t hostSelectionMaxAttempts() const PURE;

  /*
   * Retry remainings.
   */
  virtual uint32_t retriesRemaining() PURE;

  /*
   * Retry backoff.
   */
   virtual std::unique_ptr<BackOffStrategy> backoffStrategy() PURE;

  /*
   * Per try timeout.
   */
  virtual std::chrono::milliseconds perTryTimeout() const PURE;
};

WDYT @mattklein123 @snowp

@mattklein123
Copy link
Copy Markdown
Member

Yes ^ looks great and is I think what I would expect from an API/user perspective.

yxue added 6 commits March 10, 2020 18:46
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
Signed-off-by: Yan Xue <[email protected]>
@yxue
Copy link
Copy Markdown
Member Author

yxue commented Mar 11, 2020

I just push the commit to refactor the RetryPolicy code and I haven't changed the test. The change seems to be huge. I am not sure if it's on the right track. I think splitting the PR into several small PRs would be helpful for review. Here is what I thought:

  1. Add RetryPolicy interface and change current core retry policy to implement the RetryPolicy. The core retry policy (currently RetryPolicy, maybe renamed to CoreRetryPolicy) and retry policy extension will both implement the same interface.
  2. Add RetryPolicyExtension API (this could be merged with the first step)
  3. Integrate retry policy extension with the retry_state, router and config_impl, etc.
  4. Add retry policy extension example

WDYT? @mattklein123 @snowp

Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yxue! In general this looks awesome and I think is what I had in mind. I agree with your plan on splitting the PR up into multiple PRs. Sounds awesome! @snowp do you have any high level comments before we get started with that? Thank you!

/wait

// route level entry will take precedence over this config and it'll be treated
// independently (e.g.: values are not inherited).
RetryPolicy retry_policy = 16;
oneof retry_policy_specifier {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A oneof upgrade is a breaking compile change so we can't do it. I would just add the new field and have a check in code that both can't be specified. There is a oneof upgrade annotation that can be applied so it gets upgraded in the next major version.

// Implementation specific configuration which depends on the implementation being instantiated.
// Documentation for :ref:`retry policy extension <arch_overview_http_routing_retry_policy>`. See
// the supported retry policy implementations for further documentation.
google.protobuf.Any typed_config = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we think this is going to be the only field (it probably will be) I would just inline this directly in the message above?

Copy link
Copy Markdown
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this aligns with my thinking as well, thanks for iterating!

@stale
Copy link
Copy Markdown

stale bot commented Mar 19, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 19, 2020
@stale
Copy link
Copy Markdown

stale bot commented Mar 26, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Mar 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api stale stalebot believes this issue/PR has not been touched recently waiting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants