Skip to content

feat(recipes): backfill chainsaw health checks for 5 missing components#1243

Merged
mchmarny merged 1 commit into
mainfrom
feat/1221-backfill-missing-health-checks
Jun 9, 2026
Merged

feat(recipes): backfill chainsaw health checks for 5 missing components#1243
mchmarny merged 1 commit into
mainfrom
feat/1221-backfill-missing-health-checks

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

Backfill healthCheck.assertFile for the 5 registry components that
don't have one. After this PR lands, every current component has a
chainsaw health check — the precondition for #1223's lint guard.

Fixes: #1221
Related: #660 (epic), #1223

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe — registry.yaml + new
    recipes/checks/* health-check files)

Implementation Notes

Scope drift from #1221's original text

#1221 named gke-nccl-tcpxo, dynamo-crds, and kgateway-crds. The
registry has drifted since the issue was written: dynamo-crds and
kgateway-crds aren't in the current recipes/registry.yaml (they
were renamed or folded into other components). The actual 5 current
gaps are covered here. The acceptance criterion is "every current
registry component has a healthCheck.assertFile" — sticking to the
dated list would leave 4 gaps and prevent #1223 from enforcing.

Per out-of-band agreement, the issue body is not updated; the scope
drift is captured here in the PR description instead.

The 5 components

Component Type Assertions
gke-nccl-tcpxo Manifest-only 2 DaemonSets in kube-system (nccl-tcpxo-installer, device-injector) — numberReady == desiredNumberScheduled
grove Helm chart grove-operator Deployment in dynamo-systemAvailable=True + Pod-phase/container-state error guards
agentgateway-crds CRD-only 10 vendored Gateway API + Inference Extension CRDs — Established=True
prometheus-operator-crds CRD-only 8 monitoring.coreos.com/v1 CRDs — Established=True
slinky-slurm-operator-crds CRD-only 6 slinky.slurm.net CRDs — Established=True

Resource-name provenance

  • gke-nccl-tcpxo — DaemonSet names verified from in-tree
    recipes/components/gke-nccl-tcpxo/manifests/{nccl-tcpxo-installer,nri-device-injector}.yaml.
  • grove — Deployment name + namespace verified from the existing
    tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
    assertion (so the registry check produces the same signal the
    ai-conformance suite exercises).
  • agentgateway-crds — 10 CRD names verified from the in-tree
    vendored manifests under
    recipes/components/agentgateway-crds/manifests/. The chart's own
    agentgateway.dev/v1alpha1 CRDs (per the chart's values.yaml header:
    AgentgatewayBackend / Policy / Parameters) are documented in the
    check header but not asserted — their plural metadata.name values
    aren't vendored in-tree at v2.2.1 and need live verification.
  • prometheus-operator-crds — 8 CRD names from the chart's
    values.yaml header; well-known upstream prometheus-operator shape.
  • slinky-slurm-operator-crds — 6 CRD kinds from the chart's
    values.yaml header (Accounting / Controller / LoginSet / NodeSet /
    RestApi / Token). Plural metadata.name values derived from
    controller-gen defaults; flagged for live verification in the check
    header.

DaemonSet rollout pattern

For gke-nccl-tcpxo, both DaemonSets use
numberReady == desiredNumberScheduled (with `desiredNumberScheduled

0as a vacuous-pass guard). Same pattern as the NFD fix in 627ad48:numberUnavailableisomitemptyand disappears from the status when zero, sonumberUnavailable == 0evaluatesnull == 0` (false)
on a healthy DaemonSet.

Testing

go test -race -count=1 ./validators/chainsaw/ ./pkg/recipe/
golangci-lint run -c .golangci.yaml ./validators/chainsaw/... ./pkg/recipe/...
  • TestValidateTestReadOnly_RegistryContent (validators/chainsaw)
    walks every recipes/checks/*/health-check.yaml and asserts it
    passes the read-only allowlist. With this PR's 5 new files, the
    sweep now covers 27 components (was 22). All pass.
  • All in-tree pkg/recipe tests still pass.
  • lint clean.

Live-cluster validation required by #660 acceptance criteria —
deferred to @mchmarny per the same agreement as #1235. The two CRD
checks where chart-CRD plurals are not vendored in-tree
(agentgateway-crds chart CRDs, slinky-slurm-operator-crds) are
the priority for live verification; the others are deducible from
in-tree content.

Risk Assessment

  • Low — Five new declarative YAML files + 5 one-line registry
    entries. No Go code changes. No runtime behavior change for
    existing components. Easy to revert per-component.

Rollout notes: Once #1235 lit up the deployment-phase chainsaw
runner, every component with a registry assertFile runs its check
during aicr validate --phase deployment. The 5 new checks become
active on the next bundle + deploy. If a check has a name typo
that turns out wrong on the live cluster, the validator surfaces a
clear [chainsaw] <component>: health check failed line with the
chainsaw output; the resolution is to update the YAML and re-deploy.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (changed paths)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (allowlist sweep
    now covers 27 files; new tests not needed since the existing
    test walks the registry content)
  • User-facing behavior change covered in this description
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Closes #1221. Every current registry component now has a
healthCheck.assertFile, which is the precondition for PR #1223 (lint
guard requiring assertFile on new entries).

The original #1221 text named gke-nccl-tcpxo, dynamo-crds, and
kgateway-crds; the registry has drifted since the issue was written
(dynamo-crds / kgateway-crds were renamed / folded into other
components). The actual 5 current gaps are covered here. #1221's
acceptance criterion is "every current registry component has a
healthCheck.assertFile" — sticking to the dated list would leave 4
gaps and prevent #1223 from enforcing.

Components and asserts:

- gke-nccl-tcpxo (manifest-only): asserts the two DaemonSets shipped
  via recipes/components/gke-nccl-tcpxo/manifests/ —
  nccl-tcpxo-installer and device-injector in kube-system — using
  numberReady == desiredNumberScheduled (same pattern as the NFD fix
  in 627ad48; numberUnavailable is omitempty and disappears when zero).

- grove (Helm, ai-dynamo/grove v0.1.0-alpha.6): asserts the
  grove-operator Deployment in dynamo-system reaches
  Available=True, mirroring the existing assertion in
  tests/chainsaw/ai-conformance/kind-inference-dynamo/assert-dynamo.yaml
  so the registry check produces the same signal the ai-conformance
  suite exercises. Adds Pod-phase / container-state error guards.

- agentgateway-crds: asserts Established=True on the 10 vendored CRDs
  under recipes/components/agentgateway-crds/manifests/ (Gateway API
  + Inference Extension). The chart's own
  agentgateway.dev/v1alpha1 CRDs (AgentgatewayBackend / Policy /
  Parameters per the values.yaml header) are documented in the check
  header but not asserted — their plural metadata.name values aren't
  vendored in-tree at v2.2.1 and are flagged for live verification.

- prometheus-operator-crds (28.0.1): asserts Established=True on the
  8 monitoring.coreos.com/v1 CRDs enumerated in the chart's
  values.yaml header — well-known upstream shape.

- slinky-slurm-operator-crds (1.1.0): asserts Established=True on
  the 6 slinky.slurm.net CRDs enumerated in the chart's values.yaml
  header (Accounting / Controller / LoginSet / NodeSet / RestApi /
  Token). Plural metadata.name values derived from controller-gen
  defaults; flagged for live verification in the check header.

All 5 YAML files use only the read-only assert/error allowlist
(verified by TestValidateTestReadOnly_RegistryContent which sweeps
every in-tree recipes/checks/*/health-check.yaml). Live-cluster
validation against h100-eks-ubuntu-inference-dynamo (and
h100-gke-cos-training for gke-nccl-tcpxo) is the load-bearing
correctness step — deferred to mchmarny per the same agreement as
#1235.

Refs #660 (epic).
@mchmarny mchmarny requested a review from a team as a code owner June 9, 2026 09:44
@mchmarny mchmarny self-assigned this Jun 9, 2026
@github-actions github-actions Bot added the size/L label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 037ffa22-cf45-4dfc-96fa-cb93b4f379f6

📥 Commits

Reviewing files that changed from the base of the PR and between 25a6cdd and fbe5d86.

📒 Files selected for processing (6)
  • recipes/checks/agentgateway-crds/health-check.yaml
  • recipes/checks/gke-nccl-tcpxo/health-check.yaml
  • recipes/checks/grove/health-check.yaml
  • recipes/checks/prometheus-operator-crds/health-check.yaml
  • recipes/checks/slinky-slurm-operator-crds/health-check.yaml
  • recipes/registry.yaml

📝 Walkthrough

Walkthrough

This PR adds Chainsaw health-check test manifests for five components that were previously missing health-check definitions. The changes include new test files validating component readiness (CRD establishment for three CRD-only components, DaemonSet rollout for gke-nccl-tcpxo, and Deployment/Pod health for grove) and corresponding registry entries wiring each test to its component.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

  • NVIDIA/aicr#660: Directly implements the backfill of health-check manifests and registry entries required by the epic.
  • NVIDIA/aicr#1223: Adds missing healthCheck.assertFile entries and Chainsaw assert-only tests, satisfying the lint guard and allowlist requirements for the same components.
  • NVIDIA/aicr#1222: Adds similar Chainsaw health-check Test manifests under recipes/checks/ and updates recipes/registry.yaml for component registration.

Possibly related PRs

  • NVIDIA/aicr#1231: Hydrates registry healthCheck.assertFile into ComponentRef.HealthCheckAsserts and gates execution—directly relied upon by these new registry entries.
  • NVIDIA/aicr#1235: Changes the deployment validator and hydration pipeline to execute registry-declared Chainsaw tests during aicr validate --phase deployment.
  • NVIDIA/aicr#1234: Updates the validator to detect and partition Chainsaw Test-format asserts, impacting how these newly added health checks will be executed.

Suggested labels

size/M

Suggested reviewers

  • lockwobr
  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(recipes): backfill chainsaw health checks for 5 missing components' accurately and concisely describes the main change: adding health checks for five missing components.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the five components, implementation details, resource-name provenance, testing, and risk assessment.
Linked Issues check ✅ Passed The PR fulfills the primary objective from issue #1221: adding chainsaw health checks for registry components so every current component has a check. While the original issue named three components, the PR correctly addresses the actual five current gaps in the registry, explaining the scope drift (dynamo-crds and kgateway-crds no longer exist in registry.yaml). Resource names are documented with verifiable provenance, and all checks use the read-only assert/error allowlist as required.
Out of Scope Changes check ✅ Passed All changes are in-scope: five new health-check YAML files in recipes/checks/* and five one-line registry entries in recipes/registry.yaml. No extraneous or unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/1221-backfill-missing-health-checks

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

No Go source files changed in this PR.

@mchmarny

mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

make check-health COMPONENT=agentgateway-crds passes

@mchmarny mchmarny merged commit cad0142 into main Jun 9, 2026
168 of 171 checks passed
@mchmarny mchmarny deleted the feat/1221-backfill-missing-health-checks branch June 9, 2026 09:58
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backfill health checks: gke-nccl-tcpxo, dynamo-crds, kgateway-crds (#660 PR 3)

2 participants