Create a non-linear retry policy. #1065

PGijsbers · 2021-04-28T14:30:17Z

This PR implements a non-linear retry policy that is used when downloading from the server fails.

Users of the AutoML benchmark find the OpenML server failing regularly when using parallel instances of the benchmark.
Each parallel instance downloads a task and dataset (and its meta-data), which leads to many server requests.
These requests typically come in roughly at the same time because the job deployment is fully automated.
While these issues can be tackled at in the benchmark code, I think they should be addressed by OpenML.
After all, we want to support a large amount of concurrent users (automated or otherwise).

The previous policy was linear, which meant that even if the first few connections failed, the retry time was still relatively low.
This leads to a low total time which might be insufficient for the server to have handled its previous requests (e.g. large data downloads).
The previous policy was also deterministic, which means the load of requests are not distributed well.

In this PR I propose to use a retry policy based on the sigmoid function, this allows quick retries at first (to keep the experience of interactive users good) but slows down quickly.
It also introduces random fluctuations to avoid creating more artificial peaks.

While I try to provide a good experience for both interactive users and automated workloads, perhaps we should consider a retry policy setting in our configuration file.
We could make a distinction for interactive users (for which we prioritize a quick response and randomization is not necessary), and a policy for robots which is more liberal with waiting times and can have more extreme randomization.

Wait times for 20 retries (minus the noise):

0 1.0  # minimum wait time
1 1.133857018485697
2 2.359724199241831
3 3.948517463551336
4 6.384058440442351
5 10.378828427399903
6 16.0
7 21.621171572600097
8 25.615941559557648
9 28.051482536448667
10 29.640275800758168
11 30.866142981514304
12 31.950547536867308
13 32.98177897611199
14 33.99329299739067
15 34.99753210848027
16 35.99909204262595
17 36.99966597156304
18 37.99987711650796
19 38.99995479351404

Edit: This is a draft because it's not yet tested and wanted to talk through the suggestions first.

mfeurer · 2021-04-29T07:26:39Z

This is a great idea. I think I'd move the inflection point to the middle of the number of retries and try to grow less fast, but that's a matter of fine-tuning. How much is the noise that's added actually?

PGijsbers · 2021-04-29T15:02:33Z

I think I'd move the inflection point to the middle of the number of retries

Not sure, I think after a certain amount of retries the retry time should just be high, no matter how many times you allow retries. I think the number of retries also is kind of indicative of how long you are prepared to wait. In this case you decouple that slightly, meaning that if a user is prepared to wait for e.g. an hour, and set a really high retry count (that is, assuming we remove the cap at 20 retries - I'm not sure if we should keep it and/or raise it), it would also proportionally ddos the server at the start (e.g. 120 retries then also means 60 tries in quick succession before the inflection point kicks in). But I would be okay with moving the inflection point to be a few retries later.

and try to grow less fast

Maybe slightly, but I'm not convinced we should spam the server if there are >= 4 failed tries already. But here first and foremost I think the distinction between an interactive or automated user is most painful.

How much is the noise that's added actually?

I currently added N(0, 0.1 * wait_time) noise, so that retry points can diverge more over time. One random run:

base: 1.00,  noise:  0.00
base: 1.13,  noise:  0.05
base: 2.36,  noise:  0.23
base: 3.95,  noise: -0.21
base: 6.38,  noise:  1.15
base: 10.38, noise: -0.38
base: 16.00, noise:  1.39
base: 21.62, noise:  0.17
base: 25.62, noise:  1.02
base: 28.05, noise:  1.20
base: 29.64, noise:  4.42
base: 30.87, noise: -5.66
base: 31.95, noise: -2.16
base: 32.98, noise: -3.72
base: 33.99, noise: -6.80
base: 35.00, noise:  2.10
base: 36.00, noise:  1.71
base: 37.00, noise: -2.01
base: 38.00, noise: -3.38
base: 39.00, noise:  1.93

mfeurer · 2021-04-30T18:43:13Z

I'm not convinced we should spam the server if there are >= 4 failed tries already

That's a good point, convinced!

But here first and foremost I think the distinction between an interactive or automated user is most painful.

Got it after thinking more about this. Yes, it'll make sense if you add this distinction and increase the retry times for unit tests.

I currently added N(0, 0.1 * wait_time) noise, so that retry points can diverge more over time. One random run:

Looks fine on the first sight, maybe the noise can even be increased further?

Retry_policy is not yet used to affect retry frequency. Max_retries is removed because it's mostly useless, server overload is caused by many users (or parallel scripts) not a single user reconnecting. The retry policies will have acceptable times between tries either way.

doc/usage.rst

Some unit tests needed to be updated. In particular for the hash exception unit test, because we expect a hash exception, we don't want to use the ``robot`` policy as that will make it query and wait unnecessarily.

PGijsbers · 2021-05-06T13:09:42Z

openml/config.py

-max_retries = int(_defaults["max_retries"])
+
+
+def set_retry_policy(value: str, n_retries: Optional[int] = None) -> None:


Due to the connection_n_retries and retry_policy being linked, I added a function to change them together.
I think we'll want to rework all configuration properties at some point anyway to only set them by functions.
The function could then also optionally write the new setting directly to the configuration file, and should execute checks to see if it's set to a valid value (this would in turn simplify the cli code).

tests/test_datasets/test_dataset_functions.py

Create a second configurable retry policy. The configuration now allows for a `human` and `robot` retry policy, intended for interactive use and scripts, respectively.

Create a non-linear retry policy.

5e0d214

PGijsbers requested a review from mfeurer April 28, 2021 14:30

PGijsbers added 3 commits May 3, 2021 10:21

Add retry_policy documentation

761ad9d

Add retry_policy so cli config

9f522f3

mfeurer reviewed May 4, 2021

View reviewed changes

doc/usage.rst Show resolved Hide resolved

PGijsbers added 2 commits May 6, 2021 11:32

Merge branch 'develop' into retry_policy

f857d53

Use policy to determine wait time

94927bc

Some unit tests needed to be updated. In particular for the hash exception unit test, because we expect a hash exception, we don't want to use the ``robot`` policy as that will make it query and wait unnecessarily.

PGijsbers marked this pull request as ready for review May 6, 2021 13:05

PGijsbers commented May 6, 2021

View reviewed changes

tests/test_datasets/test_dataset_functions.py Show resolved Hide resolved

mfeurer approved these changes May 7, 2021

View reviewed changes

Merge branch 'develop' into retry_policy

b7f73b4

PGijsbers merged commit a505162 into develop May 10, 2021

PGijsbers deleted the retry_policy branch May 10, 2021 07:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Create a non-linear retry policy. #1065

Create a non-linear retry policy. #1065

Uh oh!

PGijsbers commented Apr 28, 2021 •

edited

Loading

Uh oh!

mfeurer commented Apr 29, 2021

Uh oh!

PGijsbers commented Apr 29, 2021

Uh oh!

mfeurer commented Apr 30, 2021

Uh oh!

Uh oh!

PGijsbers May 6, 2021

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		max_retries = int(_defaults["max_retries"])


		def set_retry_policy(value: str, n_retries: Optional[int] = None) -> None:

Uh oh!

Create a non-linear retry policy. #1065

Create a non-linear retry policy. #1065

Uh oh!

Conversation

PGijsbers commented Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfeurer commented Apr 29, 2021

Uh oh!

PGijsbers commented Apr 29, 2021

Uh oh!

mfeurer commented Apr 30, 2021

Uh oh!

Uh oh!

PGijsbers May 6, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PGijsbers commented Apr 28, 2021 •

edited

Loading