-
-
Notifications
You must be signed in to change notification settings - Fork 211
Create a non-linear retry policy. #1065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is a great idea. I think I'd move the inflection point to the middle of the number of retries and try to grow less fast, but that's a matter of fine-tuning. How much is the noise that's added actually? |
Not sure, I think after a certain amount of retries the retry time should just be high, no matter how many times you allow retries. I think the number of retries also is kind of indicative of how long you are prepared to wait. In this case you decouple that slightly, meaning that if a user is prepared to wait for e.g. an hour, and set a really high
Maybe slightly, but I'm not convinced we should spam the server if there are >= 4 failed tries already. But here first and foremost I think the distinction between an interactive or automated user is most painful.
I currently added N(0, 0.1 * |
That's a good point, convinced!
Got it after thinking more about this. Yes, it'll make sense if you add this distinction and increase the retry times for unit tests.
Looks fine on the first sight, maybe the noise can even be increased further? |
Retry_policy is not yet used to affect retry frequency. Max_retries is removed because it's mostly useless, server overload is caused by many users (or parallel scripts) not a single user reconnecting. The retry policies will have acceptable times between tries either way.
Some unit tests needed to be updated. In particular for the hash exception unit test, because we expect a hash exception, we don't want to use the ``robot`` policy as that will make it query and wait unnecessarily.
| max_retries = int(_defaults["max_retries"]) | ||
|
|
||
|
|
||
| def set_retry_policy(value: str, n_retries: Optional[int] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to the connection_n_retries and retry_policy being linked, I added a function to change them together.
I think we'll want to rework all configuration properties at some point anyway to only set them by functions.
The function could then also optionally write the new setting directly to the configuration file, and should execute checks to see if it's set to a valid value (this would in turn simplify the cli code).
Create a second configurable retry policy. The configuration now allows for a `human` and `robot` retry policy, intended for interactive use and scripts, respectively.
This PR implements a non-linear retry policy that is used when downloading from the server fails.
Users of the AutoML benchmark find the OpenML server failing regularly when using parallel instances of the benchmark.
Each parallel instance downloads a task and dataset (and its meta-data), which leads to many server requests.
These requests typically come in roughly at the same time because the job deployment is fully automated.
While these issues can be tackled at in the benchmark code, I think they should be addressed by OpenML.
After all, we want to support a large amount of concurrent users (automated or otherwise).
The previous policy was linear, which meant that even if the first few connections failed, the retry time was still relatively low.
This leads to a low total time which might be insufficient for the server to have handled its previous requests (e.g. large data downloads).
The previous policy was also deterministic, which means the load of requests are not distributed well.
In this PR I propose to use a retry policy based on the sigmoid function, this allows quick retries at first (to keep the experience of interactive users good) but slows down quickly.
It also introduces random fluctuations to avoid creating more artificial peaks.
While I try to provide a good experience for both interactive users and automated workloads, perhaps we should consider a
retry policysetting in our configuration file.We could make a distinction for
interactiveusers (for which we prioritize a quick response and randomization is not necessary), and a policy forrobots which is more liberal with waiting times and can have more extreme randomization.Wait times for 20 retries (minus the noise):
Edit: This is a draft because it's not yet tested and wanted to talk through the suggestions first.