Skip to content

Conversation

@PGijsbers
Copy link
Collaborator

@PGijsbers PGijsbers commented Apr 28, 2021

This PR implements a non-linear retry policy that is used when downloading from the server fails.

Users of the AutoML benchmark find the OpenML server failing regularly when using parallel instances of the benchmark.
Each parallel instance downloads a task and dataset (and its meta-data), which leads to many server requests.
These requests typically come in roughly at the same time because the job deployment is fully automated.
While these issues can be tackled at in the benchmark code, I think they should be addressed by OpenML.
After all, we want to support a large amount of concurrent users (automated or otherwise).

The previous policy was linear, which meant that even if the first few connections failed, the retry time was still relatively low.
This leads to a low total time which might be insufficient for the server to have handled its previous requests (e.g. large data downloads).
The previous policy was also deterministic, which means the load of requests are not distributed well.

In this PR I propose to use a retry policy based on the sigmoid function, this allows quick retries at first (to keep the experience of interactive users good) but slows down quickly.
It also introduces random fluctuations to avoid creating more artificial peaks.

While I try to provide a good experience for both interactive users and automated workloads, perhaps we should consider a retry policy setting in our configuration file.
We could make a distinction for interactive users (for which we prioritize a quick response and randomization is not necessary), and a policy for robots which is more liberal with waiting times and can have more extreme randomization.

Wait times for 20 retries (minus the noise):

0 1.0  # minimum wait time
1 1.133857018485697
2 2.359724199241831
3 3.948517463551336
4 6.384058440442351
5 10.378828427399903
6 16.0
7 21.621171572600097
8 25.615941559557648
9 28.051482536448667
10 29.640275800758168
11 30.866142981514304
12 31.950547536867308
13 32.98177897611199
14 33.99329299739067
15 34.99753210848027
16 35.99909204262595
17 36.99966597156304
18 37.99987711650796
19 38.99995479351404

Edit: This is a draft because it's not yet tested and wanted to talk through the suggestions first.

@PGijsbers PGijsbers requested a review from mfeurer April 28, 2021 14:30
@mfeurer
Copy link
Collaborator

mfeurer commented Apr 29, 2021

This is a great idea. I think I'd move the inflection point to the middle of the number of retries and try to grow less fast, but that's a matter of fine-tuning. How much is the noise that's added actually?

@PGijsbers
Copy link
Collaborator Author

I think I'd move the inflection point to the middle of the number of retries

Not sure, I think after a certain amount of retries the retry time should just be high, no matter how many times you allow retries. I think the number of retries also is kind of indicative of how long you are prepared to wait. In this case you decouple that slightly, meaning that if a user is prepared to wait for e.g. an hour, and set a really high retry count (that is, assuming we remove the cap at 20 retries - I'm not sure if we should keep it and/or raise it), it would also proportionally ddos the server at the start (e.g. 120 retries then also means 60 tries in quick succession before the inflection point kicks in). But I would be okay with moving the inflection point to be a few retries later.

and try to grow less fast

Maybe slightly, but I'm not convinced we should spam the server if there are >= 4 failed tries already. But here first and foremost I think the distinction between an interactive or automated user is most painful.

How much is the noise that's added actually?

I currently added N(0, 0.1 * wait_time) noise, so that retry points can diverge more over time. One random run:

base: 1.00,  noise:  0.00
base: 1.13,  noise:  0.05
base: 2.36,  noise:  0.23
base: 3.95,  noise: -0.21
base: 6.38,  noise:  1.15
base: 10.38, noise: -0.38
base: 16.00, noise:  1.39
base: 21.62, noise:  0.17
base: 25.62, noise:  1.02
base: 28.05, noise:  1.20
base: 29.64, noise:  4.42
base: 30.87, noise: -5.66
base: 31.95, noise: -2.16
base: 32.98, noise: -3.72
base: 33.99, noise: -6.80
base: 35.00, noise:  2.10
base: 36.00, noise:  1.71
base: 37.00, noise: -2.01
base: 38.00, noise: -3.38
base: 39.00, noise:  1.93

@mfeurer
Copy link
Collaborator

mfeurer commented Apr 30, 2021

I'm not convinced we should spam the server if there are >= 4 failed tries already

That's a good point, convinced!

But here first and foremost I think the distinction between an interactive or automated user is most painful.

Got it after thinking more about this. Yes, it'll make sense if you add this distinction and increase the retry times for unit tests.

I currently added N(0, 0.1 * wait_time) noise, so that retry points can diverge more over time. One random run:

Looks fine on the first sight, maybe the noise can even be increased further?

PGijsbers added 3 commits May 3, 2021 10:21
Retry_policy is not yet used to affect retry frequency.
Max_retries is removed because it's mostly useless, server overload is
caused by many users (or parallel scripts) not a single user
reconnecting. The retry policies will have acceptable times between
tries either way.
PGijsbers added 2 commits May 6, 2021 11:32
Some unit tests needed to be updated.
In particular for the hash exception unit test, because we expect a hash
exception, we don't want to use the ``robot`` policy as that will make
it query and wait unnecessarily.
@PGijsbers PGijsbers marked this pull request as ready for review May 6, 2021 13:05
max_retries = int(_defaults["max_retries"])


def set_retry_policy(value: str, n_retries: Optional[int] = None) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the connection_n_retries and retry_policy being linked, I added a function to change them together.
I think we'll want to rework all configuration properties at some point anyway to only set them by functions.
The function could then also optionally write the new setting directly to the configuration file, and should execute checks to see if it's set to a valid value (this would in turn simplify the cli code).

@PGijsbers PGijsbers merged commit a505162 into develop May 10, 2021
@PGijsbers PGijsbers deleted the retry_policy branch May 10, 2021 07:33
PGijsbers added a commit to Mirkazemi/openml-python that referenced this pull request Feb 23, 2023
Create a second configurable retry policy. The configuration now allows for a `human` and `robot` retry policy, intended for interactive use and scripts, respectively.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants