Skip to content

feat(recipes): deepen 21 chainsaw health checks; close epic #660#1245

Merged
mchmarny merged 2 commits into
mainfrom
feat/1222-enhance-existing-checks
Jun 9, 2026
Merged

feat(recipes): deepen 21 chainsaw health checks; close epic #660#1245
mchmarny merged 2 commits into
mainfrom
feat/1222-enhance-existing-checks

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Closes #1222. With this merged, epic #660 is fully done.

Audit + deepen 21 chainsaw health checks that landed before the epic
hit "every check is deep" depth. The PR-time runtime contract from
#1219 / #1220 / #1223 is in place; this PR closes the
depth-of-coverage gap on the actual assertions inside each
recipes/checks/*/health-check.yaml.

Fixes: #1222
Closes: #1248 (NFD numberUnavailable omitempty bug — the core fix was
landed pre-PR via #1235's d7e68a24 patch; this PR's broader NFD
touch reinforces with the projection-form container-state fix
below).
Closes epic: #660

Post-merge review fixes (@ArangoGutierrez)

After the initial 21-file enhancement landed, a follow-up review
surfaced a load-bearing correctness gap and several smaller items.
All three are now addressed in this PR:

Multi-container projection fix (was: single-element silently passes sidecars)

The pre-fix container-state error blocks used single-element list
match:

status:
  containerStatuses:
    - state: { waiting: { reason: CrashLoopBackOff } }

kyverno-json's sliceNode.Assert
(vendor/github.com/kyverno/kyverno-json/pkg/core/assertion/assertion.go:43)
gates on exact slice length BEFORE comparing elements, so this only
matches pods with exactly one container. Multi-container pods
(sidecar / proxy-injected / init+main) silently passed.

Fixed via JMESPath projection across all 22 files (21 in this PR

  • NFD's pre-existing container-state blocks + grove's):
status:
  (containerStatuses[?state.waiting.reason == 'CrashLoopBackOff' || ... ] | length(@) > `0`): true

Now fires on any container at any index. Each file also gains a
parallel initContainerStatuses block — init-container
failures (e.g., nvidia-operator-validator's 4 init containers)
are invisible to containerStatuses and were silently missed by
the prior pattern. Inline rationale comment added to every block.

NFD cross-reference clarification

The aws-efa and nvidia-dra-driver-gpu DaemonSet-fix headers cite
NFD as the exemplar for the numberReady == desiredNumberScheduled
pattern. ArangoGutierrez's review note flagged this as misleading
because NFD's pre-#1235 state used numberUnavailable == 0 (the
omitempty trap that #1248 calls out). NFD on this branch (post-#1235's
d7e68a24 patch from @njhensley) uses the correct
numberReady == desiredNumberScheduled pattern — verified in
recipes/checks/nfd/health-check.yaml. The cross-reference is
accurate as of the current head. #1248's core ask is therefore
already resolved in main; this PR additionally fixes NFD's
container-state blocks via the projection form above.

Empty-prefix DataProvider consistency (follow-up)

validators/chainsaw/allowlist_test.go now uses "" (matching the
canonical defaultEmbeddedProvider); pkg/recipe/provider_test.go
still uses "." at multiple sites — same behavior (filepath.Join
cleans both) but mixed style. Out of scope for this PR; tracked
separately if it becomes a maintenance issue.

Audit

Component Pre-PR depth Action
agentgateway-crds Full 10 CRDs Established=True skip (#1243)
agentgateway Deploy + pod-phase +container-state
aws-ebs-csi-driver Deploy + label pod-phase +container-state (label-scoped)
aws-efa DS numberReady > 0 PARTIAL + label pod-phase fix to full rollout + container-state
cert-manager 1 Deploy + pod-phase + 2 missing Deploys + 3 CRDs + container-state
dynamo-platform Deploy + pod-phase + DCD CRD + container-state
gatekeeper 2 Deploys + pod-phase +container-state
gke-nccl-tcpxo Full DS rollout skip (#1243)
gpu-operator Deploy + pod-phase + ClusterPolicy CRD + cluster-policy CR state=ready + container-state
grove Deploy Available + 1 container-state +3 missing container-state
k8s-ephemeral-storage-metrics Deploy + label pod-phase +container-state (label-scoped)
k8s-nim-operator Deploy + pod-phase +container-state
kai-scheduler Deploy + pod-phase +container-state
kube-prometheus-stack Deploy + label pod-phase +container-state (label-scoped)
kubeflow-trainer Deploy + pod-phase +container-state
kueue Deploy + pod-phase +container-state
network-operator Deploy + pod-phase +container-state
nfd Full skip (#1234)
nodewright-customizations Skyhook CR status=complete skip (deep CR state is the right signal)
nodewright-operator Deploy + pod-phase + Skyhook CRD + container-state
nvidia-dra-driver-gpu DS numberReady > 0 PARTIAL + pod-phase fix to full rollout + container-state
nvsentinel Deploy + pod-phase +container-state
prometheus-adapter Deploy + label pod-phase +container-state (label-scoped)
prometheus-operator-crds Full 8 CRDs Established=True skip (#1243)
slinky-slurm 4 CRs + 3 Deploys + StatefulSet + NodeSet + pod-phase +container-state
slinky-slurm-operator 2 Deploys + pod-phase +container-state
slinky-slurm-operator-crds Full 6 CRDs Established=True skip (#1243)

Files touched: 21 (6 untouched: 3 CRD-only that are already deep,

  • NFD/gke-nccl-tcpxo/nodewright-customizations that already have the
    right signal).

Headline Enhancements

1. DaemonSet partial → full rollout (2 files)

aws-efa and nvidia-dra-driver-gpu were gating on numberReady > 0,
which passed on partial failures. Switched to numberReady == desiredNumberScheduled with desiredNumberScheduled > 0 as the
vacuous-pass guard — same pattern as the NFD fix in #1234 and the new
gke-nccl-tcpxo in #1243. (We don't use numberUnavailable == 0
because that field is omitempty and disappears when zero — null == 0 evaluates false on a fully-healthy DaemonSet.)

2. Owned-CRD Established=True assertions (4 components)

Component CRDs asserted
gpu-operator clusterpolicies.nvidia.com
dynamo-platform dynamocomponentdeployments.nvidia.com
nodewright-operator skyhooks.skyhook.nvidia.com
cert-manager certificates.cert-manager.io, issuers.cert-manager.io, clusterissuers.cert-manager.io

CRD names verified against
tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go
verifyClusterPolicyReady check.

3. gpu-operator: ClusterPolicy state=ready migration

gpu-operator/health-check.yaml now asserts the singleton
cluster-policy CR has status.state == "ready". This migrates the
verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for
assertion-equivalence proof.
The static Chainsaw assertion is now
the canonical readiness signal for the full GPU stack (driver +
toolkit + cuda + plugin DaemonSets + validators all healthy →
operator reconciles ClusterPolicy.status.state to "ready").

4. cert-manager scope fix (real bug)

The chart deploys three Deployments: cert-manager (controller),
cert-manager-cainjector, cert-manager-webhook. The previous
check asserted only the controller. The webhook in particular is
load-bearing for admission: a healthy controller with a dead webhook
silently passes the check while the cluster rejects every
Certificate / Issuer CR. Now asserts all three.

5. Universal container-state error coverage (21 files)

Every check that previously only gated on pod phase (Pending /
Failed / Unknown) now also rejects pods stuck in Running phase with
a known-bad container-waiting reason:

  • CrashLoopBackOff
  • ImagePullBackOff
  • ErrImagePull
  • CreateContainerConfigError

The phase-only filter missed CrashLoopBackOff entirely because such
pods typically remain in Running phase with unhealthy containers.
Same pattern NFD adopted via #1234. Label-scoped in shared
namespaces (kube-system, monitoring) to avoid sibling false-matches;
namespace-only otherwise.

Also Addresses Post-Merge Review on PR #1244

validators/chainsaw/allowlist_test.go:

  • Provider prefix changed from "." to "" to match the canonical
    defaultEmbeddedProvider used at runtime. No behavior change
    (filepath.Join cleans both forms), but matches the runtime path
    exactly without the head-scratch.
  • Added a CROSS-TEST DEPENDENCY comment block noting that
    empty-assertFile and unreadable-path failure modes are owned by
    pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that
    test is ever removed or weakened, this test must be strengthened
    in lockstep so a broken registry can't silently green-light
    here.

Out of Scope

Deferred to a possible follow-up:

  • CRD Established=True for 6 more operators that own CRDs:
    gatekeeper, kueue, kubeflow-trainer, network-operator,
    k8s-nim-operator, slinky-slurm-operator. Each needs
    verification against its pinned chart version; bundling them
    blindly would risk introducing typo'd CRD names that fail on live
    clusters.
  • StatefulSet readiness for kube-prometheus-stack (Prometheus /
    Alertmanager StatefulSets shipped by the stack): chart's exact
    StatefulSet names and label conventions need verification.

Testing

go test -race -count=1 ./pkg/recipe/ ./validators/chainsaw/ ./validators/deployment/ ./pkg/defaults/ ./pkg/validator/catalog/
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./validators/...
  • TestComponentRegistry_RequiresHealthCheck (pkg/recipe): 27/27
    components pass — every entry declares assertFile + path resolves.
  • TestValidateTestReadOnly_RegistryContent (validators/chainsaw):
    walks every registry-declared assertFile and validates against the
    read-only allowlist. 27/27 pass.
  • All in-tree tests still pass.
  • Lint clean.

Live-cluster validation deferred to @mchmarny per the same
agreement as #1235 / #1243.
Priority verification targets for
this PR specifically:

  1. gpu-operator's ClusterPolicy.status.state == "ready" — the
    load-bearing assertion that migrates the Go check. If it
    diverges from the Go check's behavior on real clusters, the
    migration needs adjusting.
  2. cert-manager's 3 Deployments — confirm cert-manager-cainjector
    and cert-manager-webhook are the actual names produced by the
    chart's fullname template (vs. release-prefixed variants).
  3. The 4 owned-CRD assertions — confirm each CRD name exists
    exactly as asserted on a live cluster.

Risk Assessment

  • Medium — 21 file changes; semantically the depth bump is a
    real correctness improvement; isolated per-file revert is trivial.

Rollout notes: After merge, the deployment-phase chainsaw runner
exercises the deeper assertions on every aicr validate --phase deployment run. A component that previously passed the existence
check but had a partially-healthy DaemonSet (or a CRD that isn't
Established yet, or a webhook Deployment down) now correctly
reports the failure. The fail-loud direction is the goal — see
"Live-cluster validation" above for the verification targets.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (changed paths)
  • I did not skip/disable tests to make CI green
  • No new tests added — the existing two lint guards
    (TestComponentRegistry_RequiresHealthCheck +
    TestValidateTestReadOnly_RegistryContent) already cover the
    validation; this PR is the registry content they validate
  • No user-facing CLI/API behavior change
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

No Go source files changed in this PR.

@mchmarny mchmarny force-pushed the feat/1222-enhance-existing-checks branch from f8e9ca3 to 7e13be4 Compare June 9, 2026 11:12
coderabbitai[bot]

This comment was marked as resolved.

ArangoGutierrez

This comment was marked as resolved.

Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
coderabbitai[bot]

This comment was marked as resolved.

mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1249.

PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's
`cli-bundle-attestation-ci` chainsaw test both failed with the same
trace:

  [TIMEOUT] sigstore signing timed out:
    Post "https://rekor.sigstore.dev/api/v1/log/entries":
    giving up after 1 attempt(s): context deadline exceeded

The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s
single-pass `sign.Bundle()` call — the wrapper had no retry, so a
slow Rekor response on the only attempt turned ordinary upstream
latency into a CI failure.

Changes:

- pkg/defaults/timeouts.go:
    SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call.
    SigstoreRetryBudget (3) caps total attempts.
    SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5)
    produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s +
    1s + 5s = 111s) fits inside the existing SigstoreSignTimeout
    (2m) ceiling.

- pkg/defaults/timeouts_test.go:
    TestSigstoreRetryBudgetInvariant guards the math against future
    tuning that would overflow SigstoreSignTimeout.

- pkg/bundler/attestation/signing.go:
    Extracted the sign.Bundle invocation into signWithRetry, a
    bounded exponential-backoff retry helper. Retry semantics:
    - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry.
    - outer ctx Canceled → ErrCodeUnavailable, no retry.
    - per-attempt failure with outer ctx alive → retry until budget
      exhausted, then ErrCodeUnavailable wrapping the last error.
    Backoff sleep is interruptible by the outer ctx — a slow Rekor
    recovering 10s later doesn't waste the remaining budget.

- pkg/bundler/attestation/signing_retry_test.go:
    Five tests: success-on-first, success-after-transient (verifies
    one backoff is honored), budget-exhaustion (counts attempts +
    asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline
    (asserts ErrCodeTimeout + retry short-circuits), outer-cancel
    (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion
    test runs ~6s. All run in parallel.

- .goreleaser.yaml:
    cosign attest-blob now passes --retry 5 (matches cosign's
    documented default backoff). Costs nothing on a healthy Rekor;
    absorbs an entire release run when Rekor is slow.

- pkg/bundler/attestation/doc.go:
    New "Retry Contract" section documents the per-attempt /
    outer-ceiling split, the three retry-class branches, and the
    invariant test pointer.

Refs #1244 (first observed instance), #1245 (second instance, in
review at time of merge).
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1249.

PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's
`cli-bundle-attestation-ci` chainsaw test both failed with the same
trace:

  [TIMEOUT] sigstore signing timed out:
    Post "https://rekor.sigstore.dev/api/v1/log/entries":
    giving up after 1 attempt(s): context deadline exceeded

The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s
single-pass `sign.Bundle()` call — the wrapper had no retry, so a
slow Rekor response on the only attempt turned ordinary upstream
latency into a CI failure.

Changes:

- pkg/defaults/timeouts.go:
    SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call.
    SigstoreRetryBudget (3) caps total attempts.
    SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5)
    produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s +
    1s + 5s = 111s) fits inside the existing SigstoreSignTimeout
    (2m) ceiling.

- pkg/defaults/timeouts_test.go:
    TestSigstoreRetryBudgetInvariant guards the math against future
    tuning that would overflow SigstoreSignTimeout.

- pkg/bundler/attestation/signing.go:
    Extracted the sign.Bundle invocation into signWithRetry, a
    bounded exponential-backoff retry helper. Retry semantics:
    - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry.
    - outer ctx Canceled → ErrCodeUnavailable, no retry.
    - per-attempt failure with outer ctx alive → retry until budget
      exhausted, then ErrCodeUnavailable wrapping the last error.
    Backoff sleep is interruptible by the outer ctx — a slow Rekor
    recovering 10s later doesn't waste the remaining budget.

- pkg/bundler/attestation/signing_retry_test.go:
    Five tests: success-on-first, success-after-transient (verifies
    one backoff is honored), budget-exhaustion (counts attempts +
    asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline
    (asserts ErrCodeTimeout + retry short-circuits), outer-cancel
    (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion
    test runs ~6s. All run in parallel.

- .goreleaser.yaml:
    cosign attest-blob now passes --retry 5 (matches cosign's
    documented default backoff). Costs nothing on a healthy Rekor;
    absorbs an entire release run when Rekor is slow.

- pkg/bundler/attestation/doc.go:
    New "Retry Contract" section documents the per-attempt /
    outer-ceiling split, the three retry-class branches, and the
    invariant test pointer.

Refs #1244 (first observed instance), #1245 (second instance, in
review at time of merge).
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1249.

PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's
`cli-bundle-attestation-ci` chainsaw test both failed with the same
trace:

  [TIMEOUT] sigstore signing timed out:
    Post "https://rekor.sigstore.dev/api/v1/log/entries":
    giving up after 1 attempt(s): context deadline exceeded

The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s
single-pass `sign.Bundle()` call — the wrapper had no retry, so a
slow Rekor response on the only attempt turned ordinary upstream
latency into a CI failure.

Changes:

- pkg/defaults/timeouts.go:
    SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call.
    SigstoreRetryBudget (3) caps total attempts.
    SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5)
    produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s +
    1s + 5s = 111s) fits inside the existing SigstoreSignTimeout
    (2m) ceiling.

- pkg/defaults/timeouts_test.go:
    TestSigstoreRetryBudgetInvariant guards the math against future
    tuning that would overflow SigstoreSignTimeout.

- pkg/bundler/attestation/signing.go:
    Extracted the sign.Bundle invocation into signWithRetry, a
    bounded exponential-backoff retry helper. Retry semantics:
    - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry.
    - outer ctx Canceled → ErrCodeUnavailable, no retry.
    - per-attempt failure with outer ctx alive → retry until budget
      exhausted, then ErrCodeUnavailable wrapping the last error.
    Backoff sleep is interruptible by the outer ctx — a slow Rekor
    recovering 10s later doesn't waste the remaining budget.

- pkg/bundler/attestation/signing_retry_test.go:
    Five tests: success-on-first, success-after-transient (verifies
    one backoff is honored), budget-exhaustion (counts attempts +
    asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline
    (asserts ErrCodeTimeout + retry short-circuits), outer-cancel
    (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion
    test runs ~6s. All run in parallel.

- .goreleaser.yaml:
    cosign attest-blob now passes --retry 5 (matches cosign's
    documented default backoff). Costs nothing on a healthy Rekor;
    absorbs an entire release run when Rekor is slow.

- pkg/bundler/attestation/doc.go:
    New "Retry Contract" section documents the per-attempt /
    outer-ceiling split, the three retry-class branches, and the
    invariant test pointer.

Refs #1244 (first observed instance), #1245 (second instance, in
review at time of merge).
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1249.

PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's
`cli-bundle-attestation-ci` chainsaw test both failed with the same
trace:

  [TIMEOUT] sigstore signing timed out:
    Post "https://rekor.sigstore.dev/api/v1/log/entries":
    giving up after 1 attempt(s): context deadline exceeded

The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s
single-pass `sign.Bundle()` call — the wrapper had no retry, so a
slow Rekor response on the only attempt turned ordinary upstream
latency into a CI failure.

Changes:

- pkg/defaults/timeouts.go:
    SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call.
    SigstoreRetryBudget (3) caps total attempts.
    SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5)
    produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s +
    1s + 5s = 111s) fits inside the existing SigstoreSignTimeout
    (2m) ceiling.

- pkg/defaults/timeouts_test.go:
    TestSigstoreRetryBudgetInvariant guards the math against future
    tuning that would overflow SigstoreSignTimeout.

- pkg/bundler/attestation/signing.go:
    Extracted the sign.Bundle invocation into signWithRetry, a
    bounded exponential-backoff retry helper. Retry semantics:
    - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry.
    - outer ctx Canceled → ErrCodeUnavailable, no retry.
    - per-attempt failure with outer ctx alive → retry until budget
      exhausted, then ErrCodeUnavailable wrapping the last error.
    Backoff sleep is interruptible by the outer ctx — a slow Rekor
    recovering 10s later doesn't waste the remaining budget.

- pkg/bundler/attestation/signing_retry_test.go:
    Five tests: success-on-first, success-after-transient (verifies
    one backoff is honored), budget-exhaustion (counts attempts +
    asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline
    (asserts ErrCodeTimeout + retry short-circuits), outer-cancel
    (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion
    test runs ~6s. All run in parallel.

- .goreleaser.yaml:
    cosign attest-blob now passes --retry 5 (matches cosign's
    documented default backoff). Costs nothing on a healthy Rekor;
    absorbs an entire release run when Rekor is slow.

- pkg/bundler/attestation/doc.go:
    New "Retry Contract" section documents the per-attempt /
    outer-ceiling split, the three retry-class branches, and the
    invariant test pointer.

Refs #1244 (first observed instance), #1245 (second instance, in
review at time of merge).
@mchmarny mchmarny merged commit 81daab3 into main Jun 9, 2026
117 checks passed
@mchmarny mchmarny deleted the feat/1222-enhance-existing-checks branch June 9, 2026 14:08
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

3 participants