feat(recipes): deepen 21 chainsaw health checks; close epic #660#1245
Merged
Conversation
This was referenced Jun 9, 2026
This comment was marked as resolved.
This comment was marked as resolved.
Contributor
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
f8e9ca3 to
7e13be4
Compare
Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
7e13be4 to
fcb5bdc
Compare
12 tasks
mchmarny
added a commit
that referenced
this pull request
Jun 9, 2026
Closes #1249. PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's `cli-bundle-attestation-ci` chainsaw test both failed with the same trace: [TIMEOUT] sigstore signing timed out: Post "https://rekor.sigstore.dev/api/v1/log/entries": giving up after 1 attempt(s): context deadline exceeded The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s single-pass `sign.Bundle()` call — the wrapper had no retry, so a slow Rekor response on the only attempt turned ordinary upstream latency into a CI failure. Changes: - pkg/defaults/timeouts.go: SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call. SigstoreRetryBudget (3) caps total attempts. SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5) produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s + 1s + 5s = 111s) fits inside the existing SigstoreSignTimeout (2m) ceiling. - pkg/defaults/timeouts_test.go: TestSigstoreRetryBudgetInvariant guards the math against future tuning that would overflow SigstoreSignTimeout. - pkg/bundler/attestation/signing.go: Extracted the sign.Bundle invocation into signWithRetry, a bounded exponential-backoff retry helper. Retry semantics: - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry. - outer ctx Canceled → ErrCodeUnavailable, no retry. - per-attempt failure with outer ctx alive → retry until budget exhausted, then ErrCodeUnavailable wrapping the last error. Backoff sleep is interruptible by the outer ctx — a slow Rekor recovering 10s later doesn't waste the remaining budget. - pkg/bundler/attestation/signing_retry_test.go: Five tests: success-on-first, success-after-transient (verifies one backoff is honored), budget-exhaustion (counts attempts + asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline (asserts ErrCodeTimeout + retry short-circuits), outer-cancel (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion test runs ~6s. All run in parallel. - .goreleaser.yaml: cosign attest-blob now passes --retry 5 (matches cosign's documented default backoff). Costs nothing on a healthy Rekor; absorbs an entire release run when Rekor is slow. - pkg/bundler/attestation/doc.go: New "Retry Contract" section documents the per-attempt / outer-ceiling split, the three retry-class branches, and the invariant test pointer. Refs #1244 (first observed instance), #1245 (second instance, in review at time of merge).
mchmarny
added a commit
that referenced
this pull request
Jun 9, 2026
Closes #1249. PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's `cli-bundle-attestation-ci` chainsaw test both failed with the same trace: [TIMEOUT] sigstore signing timed out: Post "https://rekor.sigstore.dev/api/v1/log/entries": giving up after 1 attempt(s): context deadline exceeded The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s single-pass `sign.Bundle()` call — the wrapper had no retry, so a slow Rekor response on the only attempt turned ordinary upstream latency into a CI failure. Changes: - pkg/defaults/timeouts.go: SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call. SigstoreRetryBudget (3) caps total attempts. SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5) produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s + 1s + 5s = 111s) fits inside the existing SigstoreSignTimeout (2m) ceiling. - pkg/defaults/timeouts_test.go: TestSigstoreRetryBudgetInvariant guards the math against future tuning that would overflow SigstoreSignTimeout. - pkg/bundler/attestation/signing.go: Extracted the sign.Bundle invocation into signWithRetry, a bounded exponential-backoff retry helper. Retry semantics: - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry. - outer ctx Canceled → ErrCodeUnavailable, no retry. - per-attempt failure with outer ctx alive → retry until budget exhausted, then ErrCodeUnavailable wrapping the last error. Backoff sleep is interruptible by the outer ctx — a slow Rekor recovering 10s later doesn't waste the remaining budget. - pkg/bundler/attestation/signing_retry_test.go: Five tests: success-on-first, success-after-transient (verifies one backoff is honored), budget-exhaustion (counts attempts + asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline (asserts ErrCodeTimeout + retry short-circuits), outer-cancel (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion test runs ~6s. All run in parallel. - .goreleaser.yaml: cosign attest-blob now passes --retry 5 (matches cosign's documented default backoff). Costs nothing on a healthy Rekor; absorbs an entire release run when Rekor is slow. - pkg/bundler/attestation/doc.go: New "Retry Contract" section documents the per-attempt / outer-ceiling split, the three retry-class branches, and the invariant test pointer. Refs #1244 (first observed instance), #1245 (second instance, in review at time of merge).
mchmarny
added a commit
that referenced
this pull request
Jun 9, 2026
Closes #1249. PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's `cli-bundle-attestation-ci` chainsaw test both failed with the same trace: [TIMEOUT] sigstore signing timed out: Post "https://rekor.sigstore.dev/api/v1/log/entries": giving up after 1 attempt(s): context deadline exceeded The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s single-pass `sign.Bundle()` call — the wrapper had no retry, so a slow Rekor response on the only attempt turned ordinary upstream latency into a CI failure. Changes: - pkg/defaults/timeouts.go: SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call. SigstoreRetryBudget (3) caps total attempts. SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5) produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s + 1s + 5s = 111s) fits inside the existing SigstoreSignTimeout (2m) ceiling. - pkg/defaults/timeouts_test.go: TestSigstoreRetryBudgetInvariant guards the math against future tuning that would overflow SigstoreSignTimeout. - pkg/bundler/attestation/signing.go: Extracted the sign.Bundle invocation into signWithRetry, a bounded exponential-backoff retry helper. Retry semantics: - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry. - outer ctx Canceled → ErrCodeUnavailable, no retry. - per-attempt failure with outer ctx alive → retry until budget exhausted, then ErrCodeUnavailable wrapping the last error. Backoff sleep is interruptible by the outer ctx — a slow Rekor recovering 10s later doesn't waste the remaining budget. - pkg/bundler/attestation/signing_retry_test.go: Five tests: success-on-first, success-after-transient (verifies one backoff is honored), budget-exhaustion (counts attempts + asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline (asserts ErrCodeTimeout + retry short-circuits), outer-cancel (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion test runs ~6s. All run in parallel. - .goreleaser.yaml: cosign attest-blob now passes --retry 5 (matches cosign's documented default backoff). Costs nothing on a healthy Rekor; absorbs an entire release run when Rekor is slow. - pkg/bundler/attestation/doc.go: New "Retry Contract" section documents the per-attempt / outer-ceiling split, the three retry-class branches, and the invariant test pointer. Refs #1244 (first observed instance), #1245 (second instance, in review at time of merge).
mchmarny
added a commit
that referenced
this pull request
Jun 9, 2026
Closes #1249. PR #1244's `aicr recipe sign-catalog` post-hook and PR #1245's `cli-bundle-attestation-ci` chainsaw test both failed with the same trace: [TIMEOUT] sigstore signing timed out: Post "https://rekor.sigstore.dev/api/v1/log/entries": giving up after 1 attempt(s): context deadline exceeded The "1 attempt(s)" came from `pkg/bundler/attestation/signing.go`'s single-pass `sign.Bundle()` call — the wrapper had no retry, so a slow Rekor response on the only attempt turned ordinary upstream latency into a CI failure. Changes: - pkg/defaults/timeouts.go: SigstoreAttemptTimeout (35s) bounds a single sign.Bundle call. SigstoreRetryBudget (3) caps total attempts. SigstoreRetryInitialBackoff (1s) + SigstoreRetryBackoffFactor (5) produce backoffs of 1s, 5s. Worst-case wall-clock (3 × 35s + 1s + 5s = 111s) fits inside the existing SigstoreSignTimeout (2m) ceiling. - pkg/defaults/timeouts_test.go: TestSigstoreRetryBudgetInvariant guards the math against future tuning that would overflow SigstoreSignTimeout. - pkg/bundler/attestation/signing.go: Extracted the sign.Bundle invocation into signWithRetry, a bounded exponential-backoff retry helper. Retry semantics: - outer ctx DeadlineExceeded → ErrCodeTimeout, no retry. - outer ctx Canceled → ErrCodeUnavailable, no retry. - per-attempt failure with outer ctx alive → retry until budget exhausted, then ErrCodeUnavailable wrapping the last error. Backoff sleep is interruptible by the outer ctx — a slow Rekor recovering 10s later doesn't waste the remaining budget. - pkg/bundler/attestation/signing_retry_test.go: Five tests: success-on-first, success-after-transient (verifies one backoff is honored), budget-exhaustion (counts attempts + asserts ErrCodeUnavailable + wrapped sentinel), outer-deadline (asserts ErrCodeTimeout + retry short-circuits), outer-cancel (asserts ErrCodeUnavailable). Uses real timing; full-exhaustion test runs ~6s. All run in parallel. - .goreleaser.yaml: cosign attest-blob now passes --retry 5 (matches cosign's documented default backoff). Costs nothing on a healthy Rekor; absorbs an entire release run when Rekor is slow. - pkg/bundler/attestation/doc.go: New "Retry Contract" section documents the per-attempt / outer-ceiling split, the three retry-class branches, and the invariant test pointer. Refs #1244 (first observed instance), #1245 (second instance, in review at time of merge).
lalitadithya
approved these changes
Jun 9, 2026
11 tasks
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Jun 9, 2026
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Jun 9, 2026
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Jun 9, 2026
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Jun 9, 2026
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
yuanchen8911
added a commit
to yuanchen8911/aicr
that referenced
this pull request
Jun 9, 2026
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
12 tasks
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1222. With this merged, epic #660 is fully done.
Audit + deepen 21 chainsaw health checks that landed before the epic
hit "every check is deep" depth. The PR-time runtime contract from
#1219 / #1220 / #1223 is in place; this PR closes the
depth-of-coverage gap on the actual assertions inside each
recipes/checks/*/health-check.yaml.Fixes: #1222
Closes: #1248 (NFD
numberUnavailableomitempty bug — the core fix waslanded pre-PR via #1235's
d7e68a24patch; this PR's broader NFDtouch reinforces with the projection-form container-state fix
below).
Closes epic: #660
Post-merge review fixes (@ArangoGutierrez)
After the initial 21-file enhancement landed, a follow-up review
surfaced a load-bearing correctness gap and several smaller items.
All three are now addressed in this PR:
Multi-container projection fix (was: single-element silently passes sidecars)
The pre-fix container-state error blocks used single-element list
match:
kyverno-json's
sliceNode.Assert(
vendor/github.com/kyverno/kyverno-json/pkg/core/assertion/assertion.go:43)gates on exact slice length BEFORE comparing elements, so this only
matches pods with exactly one container. Multi-container pods
(sidecar / proxy-injected / init+main) silently passed.
Fixed via JMESPath projection across all 22 files (21 in this PR
Now fires on any container at any index. Each file also gains a
parallel
initContainerStatusesblock — init-containerfailures (e.g.,
nvidia-operator-validator's 4 init containers)are invisible to
containerStatusesand were silently missed bythe prior pattern. Inline rationale comment added to every block.
NFD cross-reference clarification
The aws-efa and nvidia-dra-driver-gpu DaemonSet-fix headers cite
NFD as the exemplar for the
numberReady == desiredNumberScheduledpattern. ArangoGutierrez's review note flagged this as misleading
because NFD's pre-#1235 state used
numberUnavailable == 0(theomitempty trap that #1248 calls out). NFD on this branch (post-#1235's
d7e68a24patch from @njhensley) uses the correctnumberReady == desiredNumberScheduledpattern — verified inrecipes/checks/nfd/health-check.yaml. The cross-reference isaccurate as of the current head. #1248's core ask is therefore
already resolved in main; this PR additionally fixes NFD's
container-state blocks via the projection form above.
Empty-prefix DataProvider consistency (follow-up)
validators/chainsaw/allowlist_test.gonow uses""(matching thecanonical
defaultEmbeddedProvider);pkg/recipe/provider_test.gostill uses
"."at multiple sites — same behavior (filepath.Joincleans both) but mixed style. Out of scope for this PR; tracked
separately if it becomes a maintenance issue.
Audit
agentgateway-crdsagentgatewayaws-ebs-csi-driveraws-efanumberReady > 0PARTIAL + label pod-phasecert-managerdynamo-platformgatekeepergke-nccl-tcpxogpu-operatorstate=ready+ container-stategrovek8s-ephemeral-storage-metricsk8s-nim-operatorkai-schedulerkube-prometheus-stackkubeflow-trainerkueuenetwork-operatornfdnodewright-customizationsstatus=completenodewright-operatornvidia-dra-driver-gpunumberReady > 0PARTIAL + pod-phasenvsentinelprometheus-adapterprometheus-operator-crdsslinky-slurmslinky-slurm-operatorslinky-slurm-operator-crdsFiles touched: 21 (6 untouched: 3 CRD-only that are already deep,
right signal).
Headline Enhancements
1. DaemonSet partial → full rollout (2 files)
aws-efaandnvidia-dra-driver-gpuwere gating onnumberReady > 0,which passed on partial failures. Switched to
numberReady == desiredNumberScheduledwithdesiredNumberScheduled > 0as thevacuous-pass guard — same pattern as the NFD fix in #1234 and the new
gke-nccl-tcpxoin #1243. (We don't usenumberUnavailable == 0because that field is
omitemptyand disappears when zero —null == 0evaluates false on a fully-healthy DaemonSet.)2. Owned-CRD
Established=Trueassertions (4 components)gpu-operatorclusterpolicies.nvidia.comdynamo-platformdynamocomponentdeployments.nvidia.comnodewright-operatorskyhooks.skyhook.nvidia.comcert-managercertificates.cert-manager.io,issuers.cert-manager.io,clusterissuers.cert-manager.ioCRD names verified against
tests/chainsaw/ai-conformance/cluster/assert-crds.yamland the GoverifyClusterPolicyReadycheck.3. gpu-operator: ClusterPolicy
state=readymigrationgpu-operator/health-check.yamlnow asserts the singletoncluster-policyCR hasstatus.state == "ready". This migrates theverifyClusterPolicyReadyGo check that PR #1235 task 7 deferred forassertion-equivalence proof. The static Chainsaw assertion is now
the canonical readiness signal for the full GPU stack (driver +
toolkit + cuda + plugin DaemonSets + validators all healthy →
operator reconciles ClusterPolicy.status.state to "ready").
4. cert-manager scope fix (real bug)
The chart deploys three Deployments:
cert-manager(controller),cert-manager-cainjector,cert-manager-webhook. The previouscheck asserted only the controller. The webhook in particular is
load-bearing for admission: a healthy controller with a dead webhook
silently passes the check while the cluster rejects every
Certificate / Issuer CR. Now asserts all three.
5. Universal container-state error coverage (21 files)
Every check that previously only gated on pod phase (Pending /
Failed / Unknown) now also rejects pods stuck in Running phase with
a known-bad container-waiting reason:
CrashLoopBackOffImagePullBackOffErrImagePullCreateContainerConfigErrorThe phase-only filter missed
CrashLoopBackOffentirely because suchpods typically remain in Running phase with unhealthy containers.
Same pattern NFD adopted via #1234. Label-scoped in shared
namespaces (kube-system, monitoring) to avoid sibling false-matches;
namespace-only otherwise.
Also Addresses Post-Merge Review on PR #1244
validators/chainsaw/allowlist_test.go:"."to""to match the canonicaldefaultEmbeddedProviderused at runtime. No behavior change(
filepath.Joincleans both forms), but matches the runtime pathexactly without the head-scratch.
empty-assertFile and unreadable-path failure modes are owned by
pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If thattest is ever removed or weakened, this test must be strengthened
in lockstep so a broken registry can't silently green-light
here.
Out of Scope
Deferred to a possible follow-up:
Established=Truefor 6 more operators that own CRDs:gatekeeper,kueue,kubeflow-trainer,network-operator,k8s-nim-operator,slinky-slurm-operator. Each needsverification against its pinned chart version; bundling them
blindly would risk introducing typo'd CRD names that fail on live
clusters.
Alertmanager StatefulSets shipped by the stack): chart's exact
StatefulSet names and label conventions need verification.
Testing
go test -race -count=1 ./pkg/recipe/ ./validators/chainsaw/ ./validators/deployment/ ./pkg/defaults/ ./pkg/validator/catalog/ golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./validators/...TestComponentRegistry_RequiresHealthCheck(pkg/recipe): 27/27components pass — every entry declares assertFile + path resolves.
TestValidateTestReadOnly_RegistryContent(validators/chainsaw):walks every registry-declared assertFile and validates against the
read-only allowlist. 27/27 pass.
Live-cluster validation deferred to @mchmarny per the same
agreement as #1235 / #1243. Priority verification targets for
this PR specifically:
ClusterPolicy.status.state == "ready"— theload-bearing assertion that migrates the Go check. If it
diverges from the Go check's behavior on real clusters, the
migration needs adjusting.
cert-manager-cainjectorand
cert-manager-webhookare the actual names produced by thechart's fullname template (vs. release-prefixed variants).
exactly as asserted on a live cluster.
Risk Assessment
real correctness improvement; isolated per-file revert is trivial.
Rollout notes: After merge, the deployment-phase chainsaw runner
exercises the deeper assertions on every
aicr validate --phase deploymentrun. A component that previously passed the existencecheck but had a partially-healthy DaemonSet (or a CRD that isn't
Established yet, or a webhook Deployment down) now correctly
reports the failure. The fail-loud direction is the goal — see
"Live-cluster validation" above for the verification targets.
Checklist
make testwith-race)(
TestComponentRegistry_RequiresHealthCheck+TestValidateTestReadOnly_RegistryContent) already cover thevalidation; this PR is the registry content they validate
git commit -S)