feat: implement exponential random retry strategy#225
Conversation
|
|
||
| backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) + self._jitter_delay() | ||
| # Full Jitter strategy | ||
| backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random() |
There was a problem hiding this comment.
I propose to change the implementation to compute the next retry delay this way:
def nextDelay(attempt /* 1 means called for the first time */, options):
range = options.first_retry_range
i = 1
while i<attempt:
i++
range = range * options.exponential_base
if range > options.max_retry_delay :
break
delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */
delay = min(options.max_retry_delay, delay) /* at most max_retry_delay */
return delayAdditionally, the implementation must ensure that the request is not scheduled for retries after
options.max_retry_time elapsed (max_request_time if possible).
options.max_retry_time can be the only meaningful configurable value from the library user POV. Setting to 0 disables retry.
These could be the defaults:
options.first_retry_range = 5 seconds
options.exponential_base = 2
options.max_retry_delay = 125 seconds
options.min_retry_delay = 1 second
options.max_retry_time = 180 seconds
There was a problem hiding this comment.
This delay function does no guarantee that delay is increasing. If generated random is a lot smaller than in previous attempt, then resulting delay is also smaller. I have no smart proposal how to fix at this moment though :(
There was a problem hiding this comment.
I propose this simple modification to the above algorithm to ensure that delay values are increasing and increasing enough.
+ def randomArbitrary(min, max) {
+ return random() * (max - min) + min;
+ }
...
- delay = options.min_retry_delay + (range - options.min_retry_delay) * random() /* at least min_retry_delay */
+ delay = options.min_retry_delay + (range - options.min_retry_delay) * options.random /* at least min_retry_delay */
...
+ options.random = randomArbitrary(0.5, 1.0)Or similarly in the PR like
+ self.random = randomArbitrary(0.5, 1.0)
...
- backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self._random()
+ backoff_value = self.backoff_factor * (self.exponential_base ** consecutive_errors_len) * self.randomThere was a problem hiding this comment.
Full Jitter is algorithm is described in https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/ https://github.com/aws-samples/aws-arch-backoff-simulator/blob/66cb169277051eea207dbef8c7f71767fe6af144/src/backoff_simulator.py#L40
There was a problem hiding this comment.
Anyway, either in alg in this PR, or proposed by sranka, I think random needs to be "controlled" somehow, in order to avoid delay dropping when generated random is way smaller than in previous retry. See #225 (comment)
|
|
||
| def increment(self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None): | ||
| """Return a new Retry object with incremented retry counters.""" | ||
| if self.retry_timeout < datetime.now(): |
There was a problem hiding this comment.
Does it also react the same way when retry is disabled? (max_retry_time is 0)
There was a problem hiding this comment.
yes, max_retry_time=0 means retry is disabled, here is the test:
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
==========================================
+ Coverage 89.96% 91.06% +1.10%
==========================================
Files 26 26
Lines 2003 2038 +35
==========================================
+ Hits 1802 1856 +54
+ Misses 201 182 -19
Continue to review full report at Codecov.
|
| while i <= consecutive_errors_len: | ||
| i += 1 | ||
| delay_range = delay_range * self.exponential_base | ||
| if delay_range > self.max_retry_delay: |
There was a problem hiding this comment.
@alespour had a good point that the delays should be increasing (on average), this condition makes it hard to happen since the delay range is the same after a fixed count of attempts, the delays are then oscillating randomly around self.max_retry_delay/2. This can be fixed by restricting the delay range to a large number:
| if delay_range > self.max_retry_delay: | |
| if delay_range > 100_000_000: |
Codecov Report
@@ Coverage Diff @@
## master #225 +/- ##
==========================================
+ Coverage 89.96% 91.06% +1.10%
==========================================
Files 26 26
Lines 2003 2037 +34
==========================================
+ Hits 1802 1855 +53
+ Misses 201 182 -19
Continue to review full report at Codecov.
|
| * - **max_retries** | ||
| - the number of max retries when write fails | ||
| - ``3`` | ||
| - ``10`` |
| * - **exponential_base** | ||
| - the base for the exponential retry delay, the next delay is computed as ``retry_interval * exponential_base^(attempts-1) + random(jitter_interval)`` | ||
| - ``5`` | ||
| - the base for the exponential retry delay, the next delay is computed using random exponential backoff. Example for ``retry_interval=5_000, exponential_base=2, max_retry_delay=125_000, total=5`` Retry delays are random distributed values within the ranges of ``[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]`` |
There was a problem hiding this comment.
Please add note how looks formula to compute delay.
Proposed Changes
This PR changes the default retry strategy to Full Jitter.
Related discussion is influxdata/influxdb#19722 (comment)
Original retry formula:
retry_interval * exponential_base^(attempts-1) + random(jitter_interval)Purposed exponential random retry formula:
Retry delay is calculated as random value within the interval
[
retry_interval * exponential_base^(attempts-1) and retry_interval * exponential_base^(attempts)]Example for
retry_interval=5_000, exponential_base=2, max_retry_delay=125_000Retry delays are random distributed values within the ranges of
[5_000-10_000, 10_000-20_000, 20_000-40_000, 40_000-80_000, 80_000-125_000]Checklist
pytest testscompletes successfully