feat(recipes): backfill chainsaw health checks for 5 missing components#1243
Conversation
Closes #1221. Every current registry component now has a healthCheck.assertFile, which is the precondition for PR #1223 (lint guard requiring assertFile on new entries). The original #1221 text named gke-nccl-tcpxo, dynamo-crds, and kgateway-crds; the registry has drifted since the issue was written (dynamo-crds / kgateway-crds were renamed / folded into other components). The actual 5 current gaps are covered here. #1221's acceptance criterion is "every current registry component has a healthCheck.assertFile" — sticking to the dated list would leave 4 gaps and prevent #1223 from enforcing. Components and asserts: - gke-nccl-tcpxo (manifest-only): asserts the two DaemonSets shipped via recipes/components/gke-nccl-tcpxo/manifests/ — nccl-tcpxo-installer and device-injector in kube-system — using numberReady == desiredNumberScheduled (same pattern as the NFD fix in 627ad48; numberUnavailable is omitempty and disappears when zero). - grove (Helm, ai-dynamo/grove v0.1.0-alpha.6): asserts the grove-operator Deployment in dynamo-system reaches Available=True, mirroring the existing assertion in tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml so the registry check produces the same signal the ai-conformance suite exercises. Adds Pod-phase / container-state error guards. - agentgateway-crds: asserts Established=True on the 10 vendored CRDs under recipes/components/agentgateway-crds/manifests/ (Gateway API + Inference Extension). The chart's own agentgateway.dev/v1alpha1 CRDs (AgentgatewayBackend / Policy / Parameters per the values.yaml header) are documented in the check header but not asserted — their plural metadata.name values aren't vendored in-tree at v2.2.1 and are flagged for live verification. - prometheus-operator-crds (28.0.1): asserts Established=True on the 8 monitoring.coreos.com/v1 CRDs enumerated in the chart's values.yaml header — well-known upstream shape. - slinky-slurm-operator-crds (1.1.0): asserts Established=True on the 6 slinky.slurm.net CRDs enumerated in the chart's values.yaml header (Accounting / Controller / LoginSet / NodeSet / RestApi / Token). Plural metadata.name values derived from controller-gen defaults; flagged for live verification in the check header. All 5 YAML files use only the read-only assert/error allowlist (verified by TestValidateTestReadOnly_RegistryContent which sweeps every in-tree recipes/checks/*/health-check.yaml). Live-cluster validation against h100-eks-ubuntu-inference-dynamo (and h100-gke-cos-training for gke-nccl-tcpxo) is the load-bearing correctness step — deferred to mchmarny per the same agreement as #1235. Refs #660 (epic).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (6)
📝 WalkthroughWalkthroughThis PR adds Chainsaw health-check test manifests for five components that were previously missing health-check definitions. The changes include new test files validating component readiness (CRD establishment for three CRD-only components, DaemonSet rollout for gke-nccl-tcpxo, and Deployment/Pod health for grove) and corresponding registry entries wiring each test to its component. Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
|
|
Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
Summary
Backfill
healthCheck.assertFilefor the 5 registry components thatdon't have one. After this PR lands, every current component has a
chainsaw health check — the precondition for #1223's lint guard.
Fixes: #1221
Related: #660 (epic), #1223
Type of Change
Component(s) Affected
pkg/recipe— registry.yaml + newrecipes/checks/* health-check files)
Implementation Notes
Scope drift from #1221's original text
#1221 named
gke-nccl-tcpxo,dynamo-crds, andkgateway-crds. Theregistry has drifted since the issue was written:
dynamo-crdsandkgateway-crdsaren't in the currentrecipes/registry.yaml(theywere renamed or folded into other components). The actual 5 current
gaps are covered here. The acceptance criterion is "every current
registry component has a
healthCheck.assertFile" — sticking to thedated list would leave 4 gaps and prevent #1223 from enforcing.
Per out-of-band agreement, the issue body is not updated; the scope
drift is captured here in the PR description instead.
The 5 components
gke-nccl-tcpxokube-system(nccl-tcpxo-installer,device-injector) —numberReady == desiredNumberScheduledgrovegrove-operatorDeployment indynamo-system—Available=True+ Pod-phase/container-state error guardsagentgateway-crdsEstablished=Trueprometheus-operator-crdsmonitoring.coreos.com/v1CRDs —Established=Trueslinky-slurm-operator-crdsslinky.slurm.netCRDs —Established=TrueResource-name provenance
gke-nccl-tcpxo— DaemonSet names verified from in-treerecipes/components/gke-nccl-tcpxo/manifests/{nccl-tcpxo-installer,nri-device-injector}.yaml.grove— Deployment name + namespace verified from the existingtests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yamlassertion (so the registry check produces the same signal the
ai-conformance suite exercises).
agentgateway-crds— 10 CRD names verified from the in-treevendored manifests under
recipes/components/agentgateway-crds/manifests/. The chart's ownagentgateway.dev/v1alpha1CRDs (per the chart's values.yaml header:AgentgatewayBackend / Policy / Parameters) are documented in the
check header but not asserted — their plural
metadata.namevaluesaren't vendored in-tree at v2.2.1 and need live verification.
prometheus-operator-crds— 8 CRD names from the chart'svalues.yaml header; well-known upstream prometheus-operator shape.
slinky-slurm-operator-crds— 6 CRD kinds from the chart'svalues.yaml header (Accounting / Controller / LoginSet / NodeSet /
RestApi / Token). Plural
metadata.namevalues derived fromcontroller-gen defaults; flagged for live verification in the check
header.
DaemonSet rollout pattern
For
gke-nccl-tcpxo, both DaemonSets usenumberReady == desiredNumberScheduled(with `desiredNumberScheduledTesting
go test -race -count=1 ./validators/chainsaw/ ./pkg/recipe/ golangci-lint run -c .golangci.yaml ./validators/chainsaw/... ./pkg/recipe/...TestValidateTestReadOnly_RegistryContent(validators/chainsaw)walks every
recipes/checks/*/health-check.yamland asserts itpasses the read-only allowlist. With this PR's 5 new files, the
sweep now covers 27 components (was 22). All pass.
pkg/recipetests still pass.lintclean.Live-cluster validation required by #660 acceptance criteria —
deferred to @mchmarny per the same agreement as #1235. The two CRD
checks where chart-CRD plurals are not vendored in-tree
(
agentgateway-crdschart CRDs,slinky-slurm-operator-crds) arethe priority for live verification; the others are deducible from
in-tree content.
Risk Assessment
entries. No Go code changes. No runtime behavior change for
existing components. Easy to revert per-component.
Rollout notes: Once #1235 lit up the deployment-phase chainsaw
runner, every component with a registry
assertFileruns its checkduring
aicr validate --phase deployment. The 5 new checks becomeactive on the next bundle + deploy. If a check has a name typo
that turns out wrong on the live cluster, the validator surfaces a
clear
[chainsaw] <component>: health check failedline with thechainsaw output; the resolution is to update the YAML and re-deploy.
Checklist
make testwith-race)now covers 27 files; new tests not needed since the existing
test walks the registry content)
git commit -S)