Add some jitter to pkg/retry #2444

mtrmac · 2025-05-26T20:21:08Z

... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time.

Keep the expected duration of the retry unchanged, vary it 5 % up or down.

Summary by Sourcery

Enhancements:

Introduce ±10% random jitter to calculated retry delays in pkg/retry

sourcery-ai · 2025-05-26T20:21:13Z

Reviewer's Guide

This PR augments the existing retry mechanism by injecting a random jitter of ±5% into each retry delay calculation, preventing synchronized retry storms without altering the overall expected delay duration.

File-Level Changes

Change	Details	Files
Introduce jitter into retry delay computation	Import math/rand/v2 package Compute jitter as rand.N(delay/10) - delay/20 Add jitter to the base delay while preserving its expected value	`pkg/retry/retry.go`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @mtrmac - I've reviewed your changes and they look great!

Here's what I looked at during the review

🟢 General issues: all looks good
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

Luap99

Why exactly is this needed?

This option is exposed to end users directly via --retry-delay in podman cli and containers.conf, I feel like it magically chaining the value to add jitter seems unexpected. At the very least I would expect the documentation to reflect it and also I would rather have it be the minimum delay then so that we never wait less than the user asked for.

And I guess this will break podman tests as well that match the log message

Luap99 · 2025-05-27T08:34:36Z

libimage/manifests/manifests_test.go

-	// Must show warning where libimage is going to retry 5 times with 1s delay
-	assert.Contains(t, logString, "Failed, retrying in 1s ... (1/5)", "warning not matched")
+	// Must show warning where libimage is going to retry 5 times with ~1s delay
+	assert.Regexp(t, `Failed, retrying in (1\.[0-9]*s|[0-9].*ms) ... \(1/5\)`, logString, "warning not matched")


This feels aesthetically unappealing to me at the very least. If I set the delay to 1s and it logs some ms I will be somewhat confused.

Yes … OTOH being inaccurate about the delay could be very confusing when analyzing logs; and the API lets users specify the timeout to an arbitrary time.Duration precision.

I think “if the user-specified or default timeout is specified in whole seconds, round to 10 ms” could work well here — almost all users are specifying whole seconds, or using the default ((2^attempt) whole seconds); and the hypothetical extreme users would not lose precision.

mtrmac · 2025-05-27T15:00:37Z

Why exactly is this needed?

Compare the retries / jitter discussion in https://sre.google/sre-book/addressing-cascading-failures/ .

I think being considerate to servers would be nice.

This option is exposed to end users directly via --retry-delay in podman cli and containers.conf, I feel like it magically chaining the value to add jitter seems unexpected.

I agree that this is unexpected. At the API level, it would be trivial to add an opt-in — but this is, mostly, directly controlled from the CLI; and I don’t think it makes much sense to make this an user-visible option; I’d recommend to opt all 3 callers in.

So, upsides / downsides. I don’t feel that strongly about this, and I wouldn’t mind closing this PR.

Let’s make this decision, and then I can fine-tune.

At the very least I would expect the documentation to reflect it

That’s fair.

and also I would rather have it be the minimum delay then so that we never wait less than the user asked for.

I don’t really have a preference.

Luap99 · 2025-05-27T15:39:33Z

Compare the retries / jitter discussion in https://sre.google/sre-book/addressing-cascading-failures/ .

Thanks, that is good to know.

I agree that this is unexpected. At the API level, it would be trivial to add an opt-in — but this is, mostly, directly controlled from the CLI; and I don’t think it makes much sense to make this an user-visible option; I’d recommend to opt all 3 callers in.

I agree that an option for this doesn't add much value and exposing that to users would just add extra docs that nobody is going to care about.

I think most sleep APIs always guarantee that it sleeps for a minimum of that time only anyway. So maybe if we add jitter like 100-105% of the time that would be good enough? And then maybe we could "lie" about the log line and just log what the user requested? Looking at podman tests I see at least 4 check for the exact timestamp. Sure we can update them to use regex too but it would not look nice. So maybe lying for the exact timestamp might be a good option to avoid end users seeing the jitter?

Anyhow I don't feel strongly for or against this either. Maybe we could need some more opinions, @containers/podman-maintainers?

mtrmac · 2025-05-27T15:42:46Z

I think most sleep APIs always guarantee that it sleeps for a minimum of that time only anyway. So maybe if we add jitter … And then maybe we could "lie" about the log line and just log what the user requested?

Good point about the lower-bound guarantee. It’s not typical to log not exactly what the code does, but in terms of actual promises, it would be perfectly legitimate.

... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time. Generally, delays are always a lower bound (the OS doesn't promise non-realtime code to run on _exactly_ requested times), so technically we are allowed to add this delay. Signed-off-by: Miloslav Trmač <[email protected]>

mtrmac · 2025-05-27T15:50:38Z

Updated to only add a delay, to not log it in the user output (and to add a debug log line).

giuseppe

LGTM

Luap99

/lgtm

openshift-ci · 2025-05-28T12:20:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99, mtrmac, sourcery-ai[bot]

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Luap99,giuseppe,mtrmac]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot added the approved label May 26, 2025

sourcery-ai bot approved these changes May 26, 2025

View reviewed changes

mtrmac force-pushed the jitter branch 2 times, most recently from d833c2e to 0edf0bc Compare May 26, 2025 20:53

Luap99 reviewed May 27, 2025

View reviewed changes

mtrmac force-pushed the jitter branch from 0edf0bc to 2c9191c Compare May 27, 2025 15:49

giuseppe approved these changes May 27, 2025

View reviewed changes

Luap99 approved these changes May 28, 2025

View reviewed changes

openshift-ci bot assigned Luap99 May 28, 2025

openshift-ci bot added the lgtm label May 28, 2025

openshift-merge-bot bot merged commit 1a3b5ec into containers:main May 28, 2025
14 checks passed

mtrmac deleted the jitter branch May 28, 2025 16:17

r4f4 mentioned this pull request Aug 4, 2025

NO-JIRA: Bump the oc-mirror-v2 group in /v2 with 17 updates openshift/oc-mirror#1232

Merged

Add some jitter to pkg/retry #2444

Add some jitter to pkg/retry #2444

Uh oh!

Conversation

mtrmac commented May 26, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Luap99 left a comment

Choose a reason for hiding this comment

Uh oh!

Luap99 May 27, 2025

Choose a reason for hiding this comment

Uh oh!

mtrmac May 27, 2025

Choose a reason for hiding this comment

Uh oh!

mtrmac commented May 27, 2025

Uh oh!

Luap99 commented May 27, 2025

Uh oh!

mtrmac commented May 27, 2025

Uh oh!

mtrmac commented May 27, 2025

Uh oh!

giuseppe left a comment

Choose a reason for hiding this comment

Uh oh!

Luap99 left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented May 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mtrmac commented May 26, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented May 26, 2025 •

edited

Loading