-
Notifications
You must be signed in to change notification settings - Fork 225
Add some jitter to pkg/retry #2444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR augments the existing retry mechanism by injecting a random jitter of ±5% into each retry delay calculation, preventing synchronized retry storms without altering the overall expected delay duration. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @mtrmac - I've reviewed your changes and they look great!
Here's what I looked at during the review
- 🟢 General issues: all looks good
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
d833c2e to
0edf0bc
Compare
Luap99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exactly is this needed?
This option is exposed to end users directly via --retry-delay in podman cli and containers.conf, I feel like it magically chaining the value to add jitter seems unexpected. At the very least I would expect the documentation to reflect it and also I would rather have it be the minimum delay then so that we never wait less than the user asked for.
And I guess this will break podman tests as well that match the log message
libimage/manifests/manifests_test.go
Outdated
| // Must show warning where libimage is going to retry 5 times with 1s delay | ||
| assert.Contains(t, logString, "Failed, retrying in 1s ... (1/5)", "warning not matched") | ||
| // Must show warning where libimage is going to retry 5 times with ~1s delay | ||
| assert.Regexp(t, `Failed, retrying in (1\.[0-9]*s|[0-9].*ms) ... \(1/5\)`, logString, "warning not matched") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels aesthetically unappealing to me at the very least. If I set the delay to 1s and it logs some ms I will be somewhat confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes … OTOH being inaccurate about the delay could be very confusing when analyzing logs; and the API lets users specify the timeout to an arbitrary time.Duration precision.
I think “if the user-specified or default timeout is specified in whole seconds, round to 10 ms” could work well here — almost all users are specifying whole seconds, or using the default ((2^attempt) whole seconds); and the hypothetical extreme users would not lose precision.
Compare the retries / jitter discussion in https://sre.google/sre-book/addressing-cascading-failures/ . I think being considerate to servers would be nice.
I agree that this is unexpected. At the API level, it would be trivial to add an opt-in — but this is, mostly, directly controlled from the CLI; and I don’t think it makes much sense to make this an user-visible option; I’d recommend to opt all 3 callers in. So, upsides / downsides. I don’t feel that strongly about this, and I wouldn’t mind closing this PR. Let’s make this decision, and then I can fine-tune.
That’s fair.
I don’t really have a preference. |
Thanks, that is good to know.
I agree that an option for this doesn't add much value and exposing that to users would just add extra docs that nobody is going to care about. I think most sleep APIs always guarantee that it sleeps for a minimum of that time only anyway. So maybe if we add jitter like 100-105% of the time that would be good enough? And then maybe we could "lie" about the log line and just log what the user requested? Looking at podman tests I see at least 4 check for the exact timestamp. Sure we can update them to use regex too but it would not look nice. So maybe lying for the exact timestamp might be a good option to avoid end users seeing the jitter? Anyhow I don't feel strongly for or against this either. Maybe we could need some more opinions, @containers/podman-maintainers? |
Good point about the lower-bound guarantee. It’s not typical to log not exactly what the code does, but in terms of actual promises, it would be perfectly legitimate. |
... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time. Generally, delays are always a lower bound (the OS doesn't promise non-realtime code to run on _exactly_ requested times), so technically we are allowed to add this delay. Signed-off-by: Miloslav Trmač <[email protected]>
|
Updated to only add a delay, to not log it in the user output (and to add a debug log line). |
giuseppe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Luap99
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: giuseppe, Luap99, mtrmac, sourcery-ai[bot] The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time.
Keep the expected duration of the retry unchanged, vary it 5 % up or down.
Summary by Sourcery
Enhancements: