Skip to content

Conversation

@mtrmac
Copy link
Contributor

@mtrmac mtrmac commented May 26, 2025

... so that a short failure does not trigger a cascade of synchronized retries at exactly the same time.

Keep the expected duration of the retry unchanged, vary it 5 % up or down.

Summary by Sourcery

Enhancements:

  • Introduce ±10% random jitter to calculated retry delays in pkg/retry

@sourcery-ai
Copy link

sourcery-ai bot commented May 26, 2025

Reviewer's Guide

This PR augments the existing retry mechanism by injecting a random jitter of ±5% into each retry delay calculation, preventing synchronized retry storms without altering the overall expected delay duration.

File-Level Changes

Change Details Files
Introduce jitter into retry delay computation
  • Import math/rand/v2 package
  • Compute jitter as rand.N(delay/10) - delay/20
  • Add jitter to the base delay while preserving its expected value
pkg/retry/retry.go

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mtrmac - I've reviewed your changes and they look great!

Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@mtrmac mtrmac force-pushed the jitter branch 2 times, most recently from d833c2e to 0edf0bc Compare May 26, 2025 20:53
Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exactly is this needed?

This option is exposed to end users directly via --retry-delay in podman cli and containers.conf, I feel like it magically chaining the value to add jitter seems unexpected. At the very least I would expect the documentation to reflect it and also I would rather have it be the minimum delay then so that we never wait less than the user asked for.

And I guess this will break podman tests as well that match the log message

// Must show warning where libimage is going to retry 5 times with 1s delay
assert.Contains(t, logString, "Failed, retrying in 1s ... (1/5)", "warning not matched")
// Must show warning where libimage is going to retry 5 times with ~1s delay
assert.Regexp(t, `Failed, retrying in (1\.[0-9]*s|[0-9].*ms) ... \(1/5\)`, logString, "warning not matched")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels aesthetically unappealing to me at the very least. If I set the delay to 1s and it logs some ms I will be somewhat confused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes … OTOH being inaccurate about the delay could be very confusing when analyzing logs; and the API lets users specify the timeout to an arbitrary time.Duration precision.

I think “if the user-specified or default timeout is specified in whole seconds, round to 10 ms” could work well here — almost all users are specifying whole seconds, or using the default ((2^attempt) whole seconds); and the hypothetical extreme users would not lose precision.

@mtrmac
Copy link
Contributor Author

mtrmac commented May 27, 2025

Why exactly is this needed?

Compare the retries / jitter discussion in https://sre.google/sre-book/addressing-cascading-failures/ .

I think being considerate to servers would be nice.

This option is exposed to end users directly via --retry-delay in podman cli and containers.conf, I feel like it magically chaining the value to add jitter seems unexpected.

I agree that this is unexpected. At the API level, it would be trivial to add an opt-in — but this is, mostly, directly controlled from the CLI; and I don’t think it makes much sense to make this an user-visible option; I’d recommend to opt all 3 callers in.

So, upsides / downsides. I don’t feel that strongly about this, and I wouldn’t mind closing this PR.

Let’s make this decision, and then I can fine-tune.


At the very least I would expect the documentation to reflect it

That’s fair.

and also I would rather have it be the minimum delay then so that we never wait less than the user asked for.

I don’t really have a preference.

@Luap99
Copy link
Member

Luap99 commented May 27, 2025

Compare the retries / jitter discussion in https://sre.google/sre-book/addressing-cascading-failures/ .

Thanks, that is good to know.

I agree that this is unexpected. At the API level, it would be trivial to add an opt-in — but this is, mostly, directly controlled from the CLI; and I don’t think it makes much sense to make this an user-visible option; I’d recommend to opt all 3 callers in.

I agree that an option for this doesn't add much value and exposing that to users would just add extra docs that nobody is going to care about.

I think most sleep APIs always guarantee that it sleeps for a minimum of that time only anyway. So maybe if we add jitter like 100-105% of the time that would be good enough? And then maybe we could "lie" about the log line and just log what the user requested? Looking at podman tests I see at least 4 check for the exact timestamp. Sure we can update them to use regex too but it would not look nice. So maybe lying for the exact timestamp might be a good option to avoid end users seeing the jitter?

Anyhow I don't feel strongly for or against this either. Maybe we could need some more opinions, @containers/podman-maintainers?

@mtrmac
Copy link
Contributor Author

mtrmac commented May 27, 2025

I think most sleep APIs always guarantee that it sleeps for a minimum of that time only anyway. So maybe if we add jitter … And then maybe we could "lie" about the log line and just log what the user requested?

Good point about the lower-bound guarantee. It’s not typical to log not exactly what the code does, but in terms of actual promises, it would be perfectly legitimate.

... so that a short failure does not trigger a cascade
of synchronized retries at exactly the same time.

Generally, delays are always a lower bound (the OS doesn't promise
non-realtime code to run on _exactly_ requested times), so technically
we are allowed to add this delay.

Signed-off-by: Miloslav Trmač <[email protected]>
@mtrmac
Copy link
Contributor Author

mtrmac commented May 27, 2025

Updated to only add a delay, to not log it in the user output (and to add a debug log line).

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: giuseppe, Luap99, mtrmac, sourcery-ai[bot]

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [Luap99,giuseppe,mtrmac]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 1a3b5ec into containers:main May 28, 2025
14 checks passed
@mtrmac mtrmac deleted the jitter branch May 28, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants