feat(recipes): deepen 21 chainsaw health checks; close epic #660 by mchmarny · Pull Request #1245 · NVIDIA/aicr

mchmarny · 2026-06-09T10:56:51Z

Summary

Closes #1222. With this merged, epic #660 is fully done.

Audit + deepen 21 chainsaw health checks that landed before the epic
hit "every check is deep" depth. The PR-time runtime contract from
#1219 / #1220 / #1223 is in place; this PR closes the
depth-of-coverage gap on the actual assertions inside each
recipes/checks/*/health-check.yaml.

Fixes: #1222
Closes: #1248 (NFD numberUnavailable omitempty bug — the core fix was
landed pre-PR via #1235's d7e68a24 patch; this PR's broader NFD
touch reinforces with the projection-form container-state fix
below).
Closes epic: #660

Post-merge review fixes (@ArangoGutierrez)

After the initial 21-file enhancement landed, a follow-up review
surfaced a load-bearing correctness gap and several smaller items.
All three are now addressed in this PR:

Multi-container projection fix (was: single-element silently passes sidecars)

The pre-fix container-state error blocks used single-element list
match:

status:
  containerStatuses:
    - state: { waiting: { reason: CrashLoopBackOff } }

kyverno-json's sliceNode.Assert
(vendor/github.com/kyverno/kyverno-json/pkg/core/assertion/assertion.go:43)
gates on exact slice length BEFORE comparing elements, so this only
matches pods with exactly one container. Multi-container pods
(sidecar / proxy-injected / init+main) silently passed.

Fixed via JMESPath projection across all 22 files (21 in this PR

NFD's pre-existing container-state blocks + grove's):

status:
  (containerStatuses[?state.waiting.reason == 'CrashLoopBackOff' || ... ] | length(@) > `0`): true

Now fires on any container at any index. Each file also gains a
parallel initContainerStatuses block — init-container
failures (e.g., nvidia-operator-validator's 4 init containers)
are invisible to containerStatuses and were silently missed by
the prior pattern. Inline rationale comment added to every block.

NFD cross-reference clarification

The aws-efa and nvidia-dra-driver-gpu DaemonSet-fix headers cite
NFD as the exemplar for the numberReady == desiredNumberScheduled
pattern. ArangoGutierrez's review note flagged this as misleading
because NFD's pre-#1235 state used numberUnavailable == 0 (the
omitempty trap that #1248 calls out). NFD on this branch (post-#1235's
d7e68a24 patch from @njhensley) uses the correct
numberReady == desiredNumberScheduled pattern — verified in
recipes/checks/nfd/health-check.yaml. The cross-reference is
accurate as of the current head. #1248's core ask is therefore
already resolved in main; this PR additionally fixes NFD's
container-state blocks via the projection form above.

Empty-prefix DataProvider consistency (follow-up)

validators/chainsaw/allowlist_test.go now uses "" (matching the
canonical defaultEmbeddedProvider); pkg/recipe/provider_test.go
still uses "." at multiple sites — same behavior (filepath.Join
cleans both) but mixed style. Out of scope for this PR; tracked
separately if it becomes a maintenance issue.

Audit

Component	Pre-PR depth	Action
`agentgateway-crds`	Full 10 CRDs Established=True	skip (#1243)
`agentgateway`	Deploy + pod-phase	+container-state
`aws-ebs-csi-driver`	Deploy + label pod-phase	+container-state (label-scoped)
`aws-efa`	DS `numberReady > 0` PARTIAL + label pod-phase	fix to full rollout + container-state
`cert-manager`	1 Deploy + pod-phase	+ 2 missing Deploys + 3 CRDs + container-state
`dynamo-platform`	Deploy + pod-phase	+ DCD CRD + container-state
`gatekeeper`	2 Deploys + pod-phase	+container-state
`gke-nccl-tcpxo`	Full DS rollout	skip (#1243)
`gpu-operator`	Deploy + pod-phase	+ ClusterPolicy CRD + cluster-policy CR `state=ready` + container-state
`grove`	Deploy Available + 1 container-state	+3 missing container-state
`k8s-ephemeral-storage-metrics`	Deploy + label pod-phase	+container-state (label-scoped)
`k8s-nim-operator`	Deploy + pod-phase	+container-state
`kai-scheduler`	Deploy + pod-phase	+container-state
`kube-prometheus-stack`	Deploy + label pod-phase	+container-state (label-scoped)
`kubeflow-trainer`	Deploy + pod-phase	+container-state
`kueue`	Deploy + pod-phase	+container-state
`network-operator`	Deploy + pod-phase	+container-state
`nfd`	Full	skip (#1234)
`nodewright-customizations`	Skyhook CR `status=complete`	skip (deep CR state is the right signal)
`nodewright-operator`	Deploy + pod-phase	+ Skyhook CRD + container-state
`nvidia-dra-driver-gpu`	DS `numberReady > 0` PARTIAL + pod-phase	fix to full rollout + container-state
`nvsentinel`	Deploy + pod-phase	+container-state
`prometheus-adapter`	Deploy + label pod-phase	+container-state (label-scoped)
`prometheus-operator-crds`	Full 8 CRDs Established=True	skip (#1243)
`slinky-slurm`	4 CRs + 3 Deploys + StatefulSet + NodeSet + pod-phase	+container-state
`slinky-slurm-operator`	2 Deploys + pod-phase	+container-state
`slinky-slurm-operator-crds`	Full 6 CRDs Established=True	skip (#1243)

Files touched: 21 (6 untouched: 3 CRD-only that are already deep,

NFD/gke-nccl-tcpxo/nodewright-customizations that already have the
right signal).

Headline Enhancements

1. DaemonSet partial → full rollout (2 files)

aws-efa and nvidia-dra-driver-gpu were gating on numberReady > 0,
which passed on partial failures. Switched to numberReady == desiredNumberScheduled with desiredNumberScheduled > 0 as the
vacuous-pass guard — same pattern as the NFD fix in #1234 and the new
gke-nccl-tcpxo in #1243. (We don't use numberUnavailable == 0
because that field is omitempty and disappears when zero — null == 0 evaluates false on a fully-healthy DaemonSet.)

2. Owned-CRD `Established=True` assertions (4 components)

Component	CRDs asserted
`gpu-operator`	`clusterpolicies.nvidia.com`
`dynamo-platform`	`dynamocomponentdeployments.nvidia.com`
`nodewright-operator`	`skyhooks.skyhook.nvidia.com`
`cert-manager`	`certificates.cert-manager.io`, `issuers.cert-manager.io`, `clusterissuers.cert-manager.io`

CRD names verified against
tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go
verifyClusterPolicyReady check.

3. gpu-operator: ClusterPolicy `state=ready` migration

gpu-operator/health-check.yaml now asserts the singleton
cluster-policy CR has status.state == "ready". This migrates the
verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for
assertion-equivalence proof. The static Chainsaw assertion is now
the canonical readiness signal for the full GPU stack (driver +
toolkit + cuda + plugin DaemonSets + validators all healthy →
operator reconciles ClusterPolicy.status.state to "ready").

4. cert-manager scope fix (real bug)

The chart deploys three Deployments: cert-manager (controller),
cert-manager-cainjector, cert-manager-webhook. The previous
check asserted only the controller. The webhook in particular is
load-bearing for admission: a healthy controller with a dead webhook
silently passes the check while the cluster rejects every
Certificate / Issuer CR. Now asserts all three.

5. Universal container-state error coverage (21 files)

Every check that previously only gated on pod phase (Pending /
Failed / Unknown) now also rejects pods stuck in Running phase with
a known-bad container-waiting reason:

CrashLoopBackOff
ImagePullBackOff
ErrImagePull
CreateContainerConfigError

The phase-only filter missed CrashLoopBackOff entirely because such
pods typically remain in Running phase with unhealthy containers.
Same pattern NFD adopted via #1234. Label-scoped in shared
namespaces (kube-system, monitoring) to avoid sibling false-matches;
namespace-only otherwise.

Also Addresses Post-Merge Review on PR #1244

validators/chainsaw/allowlist_test.go:

Provider prefix changed from "." to "" to match the canonical
defaultEmbeddedProvider used at runtime. No behavior change
(filepath.Join cleans both forms), but matches the runtime path
exactly without the head-scratch.
Added a CROSS-TEST DEPENDENCY comment block noting that
empty-assertFile and unreadable-path failure modes are owned by
pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that
test is ever removed or weakened, this test must be strengthened
in lockstep so a broken registry can't silently green-light
here.

Out of Scope

Deferred to a possible follow-up:

CRD Established=True for 6 more operators that own CRDs:
gatekeeper, kueue, kubeflow-trainer, network-operator,
k8s-nim-operator, slinky-slurm-operator. Each needs
verification against its pinned chart version; bundling them
blindly would risk introducing typo'd CRD names that fail on live
clusters.
StatefulSet readiness for kube-prometheus-stack (Prometheus /
Alertmanager StatefulSets shipped by the stack): chart's exact
StatefulSet names and label conventions need verification.

Testing

go test -race -count=1 ./pkg/recipe/ ./validators/chainsaw/ ./validators/deployment/ ./pkg/defaults/ ./pkg/validator/catalog/
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./validators/...

TestComponentRegistry_RequiresHealthCheck (pkg/recipe): 27/27
components pass — every entry declares assertFile + path resolves.
TestValidateTestReadOnly_RegistryContent (validators/chainsaw):
walks every registry-declared assertFile and validates against the
read-only allowlist. 27/27 pass.
All in-tree tests still pass.
Lint clean.

Live-cluster validation deferred to @mchmarny per the same
agreement as #1235 / #1243. Priority verification targets for
this PR specifically:

gpu-operator's ClusterPolicy.status.state == "ready" — the
load-bearing assertion that migrates the Go check. If it
diverges from the Go check's behavior on real clusters, the
migration needs adjusting.
cert-manager's 3 Deployments — confirm cert-manager-cainjector
and cert-manager-webhook are the actual names produced by the
chart's fullname template (vs. release-prefixed variants).
The 4 owned-CRD assertions — confirm each CRD name exists
exactly as asserted on a live cluster.

Risk Assessment

Medium — 21 file changes; semantically the depth bump is a
real correctness improvement; isolated per-file revert is trivial.

Rollout notes: After merge, the deployment-phase chainsaw runner
exercises the deeper assertions on every aicr validate --phase deployment run. A component that previously passed the existence
check but had a partially-healthy DaemonSet (or a CRD that isn't
Established yet, or a webhook Deployment down) now correctly
reports the failure. The fail-loud direction is the goal — see
"Live-cluster validation" above for the verification targets.

Checklist

Tests pass locally (make test with -race)
Linter passes (changed paths)
I did not skip/disable tests to make CI green
No new tests added — the existing two lint guards
(TestComponentRegistry_RequiresHealthCheck +
TestValidateTestReadOnly_RegistryContent) already cover the
validation; this PR is the registry content they validate
No user-facing CLI/API behavior change
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-06-09T11:07:11Z

Coverage Report ✅

Metric	Value
Coverage	76.3%
Threshold	75%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

No Go source files changed in this PR.

Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.

Closes #1249. PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's `cli-bundle-attestation-ci` chainsaw test both failed with the same trace: [TIMEOUT] sigstore signing timed out: Post "https://rekor.sigstore.dev/api/v1/log/entries": giving up after 1 attempt(s): context deadline exceeded The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s single-pass `sign.Bundle()` call — the wrapper had no retry, so a slow Rekor response on the only attempt turned ordinary upstream latency into a CI failure. Changes: - pkg/defaults/timeouts.go: SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call. SigstoreRetryBudget (3) caps total attempts. SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5) produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s + 1s + 5s = 111s) fits inside the existing SigstoreSignTimeout (2m) ceiling. - pkg/defaults/timeouts_test.go: TestSigstoreRetryBudgetInvariant guards the math against future tuning that would overflow SigstoreSignTimeout. - pkg/bundler/attestation/signing.go: Extracted the sign.Bundle invocation into signWithRetry, a bounded exponential-backoff retry helper. Retry semantics: - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry. - outer ctx Canceled → ErrCodeUnavailable, no retry. - per-attempt failure with outer ctx alive → retry until budget exhausted, then ErrCodeUnavailable wrapping the last error. Backoff sleep is interruptible by the outer ctx — a slow Rekor recovering 10s later doesn't waste the remaining budget. - pkg/bundler/attestation/signing_retry_test.go: Five tests: success-on-first, success-after-transient (verifies one backoff is honored), budget-exhaustion (counts attempts + asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline (asserts ErrCodeTimeout + retry short-circuits), outer-cancel (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion test runs ~6s. All run in parallel. - .goreleaser.yaml: cosign attest-blob now passes --retry 5 (matches cosign's documented default backoff). Costs nothing on a healthy Rekor; absorbs an entire release run when Rekor is slow. - pkg/bundler/attestation/doc.go: New "Retry Contract" section documents the per-attempt / outer-ceiling split, the three retry-class branches, and the invariant test pointer. Refs #1244 (first observed instance), #1245 (second instance, in review at time of merge).

Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.

mchmarny requested review from a team as code owners June 9, 2026 10:56

mchmarny added enhancement area/recipes area/validator labels Jun 9, 2026

mchmarny self-assigned this Jun 9, 2026

github-actions Bot added size/XL and removed area/validator labels Jun 9, 2026

This was referenced Jun 9, 2026

Backfill CRD Established=True for 6 operator-owning health checks #1246

Open

Add Prometheus + Alertmanager StatefulSet readiness to kube-prometheus-stack check #1247

Closed

This comment was marked as resolved.

Sign in to view

mchmarny force-pushed the feat/1222-enhance-existing-checks branch from f8e9ca3 to 7e13be4 Compare June 9, 2026 11:12

This comment was marked as resolved.

Sign in to view

ArangoGutierrez mentioned this pull request Jun 9, 2026

fix(recipes): NFD worker-DaemonSet check asserts omitempty numberUnavailable (fails on healthy cluster) #1248

Closed

mchmarny force-pushed the feat/1222-enhance-existing-checks branch from 7e13be4 to fcb5bdc Compare June 9, 2026 11:37

mchmarny mentioned this pull request Jun 9, 2026

Bump cosign / sign.Bundle retry budget to absorb transient Rekor flakes #1249

Closed

This comment was marked as resolved.

Sign in to view

mchmarny mentioned this pull request Jun 9, 2026

feat(bundler): retry sign.Bundle on transient Sigstore failures #1251

Merged

12 tasks

Merge branch 'main' into feat/1222-enhance-existing-checks

996e846

lalitadithya approved these changes Jun 9, 2026

View reviewed changes

mchmarny merged commit 81daab3 into main Jun 9, 2026
117 checks passed

mchmarny deleted the feat/1222-enhance-existing-checks branch June 9, 2026 14:08

yuanchen8911 mentioned this pull request Jun 9, 2026

fix(validator): fail fast on eval errors; nil-safe health checks #1261

Merged

11 tasks

mchmarny mentioned this pull request Jun 11, 2026

chore: trim agent rules under context limit, add review patterns #1322

Merged

12 tasks

mohityadav8 mentioned this pull request Jun 26, 2026

feat(recipes): add Prometheus/Alertmanager StatefulSet readiness to kube-prometheus-stack check #1497

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recipes): deepen 21 chainsaw health checks; close epic #660#1245

feat(recipes): deepen 21 chainsaw health checks; close epic #660#1245
mchmarny merged 2 commits into
mainfrom
feat/1222-enhance-existing-checks

mchmarny commented Jun 9, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mchmarny commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Post-merge review fixes (@ArangoGutierrez)

Multi-container projection fix (was: single-element silently passes sidecars)

NFD cross-reference clarification

Empty-prefix DataProvider consistency (follow-up)

Audit

Headline Enhancements

1. DaemonSet partial → full rollout (2 files)

2. Owned-CRD Established=True assertions (4 components)

3. gpu-operator: ClusterPolicy state=ready migration

4. cert-manager scope fix (real bug)

5. Universal container-state error coverage (21 files)

Also Addresses Post-Merge Review on PR #1244

Out of Scope

Testing

Risk Assessment

Checklist

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mchmarny commented Jun 9, 2026 •

edited

Loading

2. Owned-CRD `Established=True` assertions (4 components)

3. gpu-operator: ClusterPolicy `state=ready` migration

github-actions Bot commented Jun 9, 2026 •

edited

Loading