Skip to content

feat(bundler): retry sign.Bundle on transient Sigstore failures#1251

Merged
mchmarny merged 1 commit into
mainfrom
feat/1249-sigstore-retry-budget
Jun 9, 2026
Merged

feat(bundler): retry sign.Bundle on transient Sigstore failures#1251
mchmarny merged 1 commit into
mainfrom
feat/1249-sigstore-retry-budget

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Add bounded exponential-backoff retry around sign.Bundle in the
AICR-side Sigstore signing wrapper, and wrap goreleaser's
cosign attest-blob post-hook in a bash 3-attempt retry loop with
exponential backoff. Absorbs transient Sigstore Rekor flakes
that have been turning ordinary upstream latency into CI failures.

Fixes: #1249
Refs: #1244 (first observed instance), #1245 (second; in review)

Type of Change

  • Bug fix (transient-flake mitigation)

Component(s) Affected

  • pkg/bundler/attestation — retry wrapper
  • pkg/defaults — new retry-budget constants + invariant test
  • .goreleaser.yaml — cosign CLI retry flag

Implementation Notes

Failure being absorbed

Two PRs hit identical traces within 24h:

[TIMEOUT] sigstore signing timed out:
  Post "https://rekor.sigstore.dev/api/v1/log/entries":
  giving up after 1 attempt(s): context deadline exceeded

The 1 attempt(s) came from pkg/bundler/attestation/signing.go's
single-pass sign.Bundle() invocation. The wrapper had no retry, so
slow Rekor responses on the only attempt produced terminal failure.

Constants (pkg/defaults/timeouts.go)

Constant Value Rationale
SigstoreAttemptTimeout 35s Bounds one sign.Bundle call so a slow Rekor on one attempt doesn't eat the whole budget
SigstoreRetryBudget 3 Worst-case budget that absorbs the typical flake window
SigstoreRetryInitialBackoff 1s Initial gap between attempts 1 and 2
SigstoreRetryBackoffFactor 5 Backoffs: 1s, 5s

Worst-case wall-clock: 3 × 35s + 1s + 5s = 111s — fits inside the
existing SigstoreSignTimeout = 2m ceiling.
TestSigstoreRetryBudgetInvariant guards the math against future
tuning that would overflow.

Retry semantics (signWithRetry helper)

Extracted the sign.Bundle invocation into a helper that wraps any
sign-attempt closure with bounded exponential-backoff retry:

Class Behavior
Outer ctx DeadlineExceeded ErrCodeTimeout, no further retries — the whole signing budget is gone
Outer ctx Canceled ErrCodeUnavailable, no retries — caller signaled don't-wait
Per-attempt failure with outer ctx alive Retry until budget is exhausted, then ErrCodeUnavailable wrapping the last error

Backoff sleep is interruptible by the outer ctx — a Rekor recovering
10s later doesn't waste the remaining budget. The retry treats
transient failures uniformly without trying to parse error text or
HTTP status (which would be brittle); the bounded retry budget caps
the wasted time on a permanent failure (e.g., expired OIDC token) at
~111s.

.goreleaser.yaml

cosign attest-blob is now wrapped in a bash 3-attempt retry loop
with 5s / 10s backoffs. First attempt of this PR tried
cosign attest-blob --retry 5 — that flag does not exist on cosign
v3.0.2 (Rekor retry behavior is internal to the rekor-go client and
not exposed via the cosign CLI), so the first CI run failed with
Error: unknown flag: --retry. Shell-level retry loop is the
equivalent mitigation for the binary-attestation step; the AICR-side
signWithRetry covers the catalog-signing step on the same
post-build hook chain.

Docs

pkg/bundler/attestation/doc.go gains a "Retry Contract" section
documenting the per-attempt / outer-ceiling split, the three
retry-class branches, and the invariant test pointer.

Testing

go test -race -count=1 ./pkg/defaults/ ./pkg/bundler/...
golangci-lint run -c .golangci.yaml ./pkg/defaults/... ./pkg/bundler/...

Five new retry tests in pkg/bundler/attestation/signing_retry_test.go:

Test Scenario Asserts
_SuccessOnFirstAttempt Attempt returns success immediately 1 attempt, no backoff
_SuccessAfterTransient Attempt fails once, succeeds on 2nd 2 attempts, ~1s elapsed (one backoff honored)
_BudgetExhaustion Attempt always fails SigstoreRetryBudget attempts, ErrCodeUnavailable, sentinel wrapped in chain
_OuterDeadlineExceeded Outer ctx deadline shorter than first backoff ErrCodeTimeout, fewer than budget attempts
_OuterCanceled Caller cancels during first backoff ErrCodeUnavailable, cancel takes precedence over deadline

Plus TestSigstoreRetryBudgetInvariant in pkg/defaults/.

All tests use t.Parallel(). Full-exhaustion test runs ~6s wall-clock
(real backoffs); others are sub-second.

Risk Assessment

  • Low — isolated to one helper; existing single-pass behavior
    is the new "success on attempt 1" path; outer ceiling unchanged.

Rollout notes: No user-facing CLI/API behavior change. Healthy
Rekor paths see identical latency. Transient Rekor failures that
previously failed the run now succeed after a few seconds of
backoff (visible as slog.Warn lines naming attempt + backoff +
error). A permanent failure (e.g., expired OIDC token, Fulcio 4xx)
takes up to ~111s instead of failing fast — acceptable trade-off
since the error message is unchanged.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (changed paths)
  • I did not skip/disable tests to make CI green
  • I added new tests for the new functionality
  • Docs updated (pkg/bundler/attestation/doc.go)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@mchmarny mchmarny requested review from a team as code owners June 9, 2026 12:03
@mchmarny mchmarny self-assigned this Jun 9, 2026
@github-actions github-actions Bot added size/L and removed area/ci labels Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/bundler/attestation 78.69% (+0.39%) 👍
github.com/NVIDIA/aicr/pkg/defaults 100.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/bundler/attestation/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/bundler/attestation/signing.go 87.25% (-1.06%) 102 (+25) 89 (+21) 13 (+4) 👎
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny force-pushed the feat/1249-sigstore-retry-budget branch 3 times, most recently from 053dbdc to 7e0f35c Compare June 9, 2026 12:20
@mchmarny

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

Closes #1249.

PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's
`cli-bundle-attestation-ci` chainsaw test both failed with the same
trace:

  [TIMEOUT] sigstore signing timed out:
    Post "https://rekor.sigstore.dev/api/v1/log/entries":
    giving up after 1 attempt(s): context deadline exceeded

The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s
single-pass `sign.Bundle()` call — the wrapper had no retry, so a
slow Rekor response on the only attempt turned ordinary upstream
latency into a CI failure.

Changes:

- pkg/defaults/timeouts.go:
    SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call.
    SigstoreRetryBudget (3) caps total attempts.
    SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5)
    produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s +
    1s + 5s = 111s) fits inside the existing SigstoreSignTimeout
    (2m) ceiling.

- pkg/defaults/timeouts_test.go:
    TestSigstoreRetryBudgetInvariant guards the math against future
    tuning that would overflow SigstoreSignTimeout.

- pkg/bundler/attestation/signing.go:
    Extracted the sign.Bundle invocation into signWithRetry, a
    bounded exponential-backoff retry helper. Retry semantics:
    - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry.
    - outer ctx Canceled → ErrCodeUnavailable, no retry.
    - per-attempt failure with outer ctx alive → retry until budget
      exhausted, then ErrCodeUnavailable wrapping the last error.
    Backoff sleep is interruptible by the outer ctx — a slow Rekor
    recovering 10s later doesn't waste the remaining budget.

- pkg/bundler/attestation/signing_retry_test.go:
    Five tests: success-on-first, success-after-transient (verifies
    one backoff is honored), budget-exhaustion (counts attempts +
    asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline
    (asserts ErrCodeTimeout + retry short-circuits), outer-cancel
    (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion
    test runs ~6s. All run in parallel.

- .goreleaser.yaml:
    cosign attest-blob now passes --retry 5 (matches cosign's
    documented default backoff). Costs nothing on a healthy Rekor;
    absorbs an entire release run when Rekor is slow.

- pkg/bundler/attestation/doc.go:
    New "Retry Contract" section documents the per-attempt /
    outer-ceiling split, the three retry-class branches, and the
    invariant test pointer.

Refs #1244 (first observed instance), #1245 (second instance, in
review at time of merge).
@mchmarny mchmarny force-pushed the feat/1249-sigstore-retry-budget branch from 746e407 to 5d04d41 Compare June 9, 2026 12:35
@mchmarny mchmarny merged commit b847f96 into main Jun 9, 2026
31 of 32 checks passed
@mchmarny mchmarny deleted the feat/1249-sigstore-retry-budget branch June 9, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump cosign / sign.Bundle retry budget to absorb transient Rekor flakes

2 participants