Skip to content

[EPIC] Reuse Chainsaw health checks in aicr validate --phase deployment #660

Description

@yuanchen8911

Summary

Epic tracking the migration of aicr validate --phase deployment from its
current shallow baseline (Active namespaces, expectedResources, three
GPU-specific deep checks from #611) to registry-declared Chainsaw health
checks
(Option 2 from #622).

This is a follow-up to the closed predecessor #622. The project has pivoted
away from the pure-Go readiness contract proposed there. Instead, the
existing Chainsaw assertions under recipes/checks/* — already referenced
by healthCheck.assertFile in recipes/registry.yaml but unused at runtime
— become the source of truth for post-install readiness.

Work is broken into 5 child issues that land independently.

Problem / Use Case

After #611, deployment validation is stronger but still asymmetric:

  • baseline checks run for enabled components (Active namespaces, healthy
    declared expectedResources)
  • targeted deep checks run for three GPU-specific gaps
    (skyhook-customizations, nvidia-dra-driver-gpu, gpu-operator)
  • most other components get only shallow validation unless they declare
    expectedResources

Meanwhile, recipes/checks/* already contains Chainsaw health checks for
19 of the 22 registry components, and healthCheck.assertFile is already
declared for them in recipes/registry.yaml. These checks are the
highest-fidelity readiness logic the project has, but they are not wired
into aicr validate --phase deployment today:

  • healthCheck.assertFile is not hydrated into
    ComponentRef.HealthCheckAsserts during recipe resolution
  • pkg/recipe/metadata.go intentionally skips loading the assert content
  • the deployment validator image does not ship the chainsaw binary
  • the deployment-phase Chainsaw runner that exists in the code path is
    therefore dormant

The result: the existing Chainsaw assertions are maintained but unused by
aicr validate --phase deployment, and the deployment phase re-implements
a shallower version of the same intent.

Goal

Make aicr validate --phase deployment execute the existing Chainsaw
health checks for every registry component that has one, enhance / backfill
those checks so the deployment phase provides deep, symmetric, post-install
validation across the registry, and lock the invariant with lint
enforcement.

No new validation phase. No separate standalone validator. Reuse of the
existing recipes/checks/* content is the contract.

Child Issues

Constraints (apply to every child)

  • Validator Jobs consume only the mounted recipe.yaml; they do not read
    registry.yaml at runtime.
  • No Helm API calls, no release-metadata reads, no release-scoped label
    dependencies (app.kubernetes.io/instance).
  • Validator Jobs continue to run under cluster-admin-bound ServiceAccount
    (see pkg/validator/job/rbac.go:41-67); registry-declared assert content
    is restricted to a read-only allowlist (assert and error only) to
    bound that posture.
  • Versions / checksums for chainsaw come from .settings.yaml only
    (chainsaw and chainsaw_checksums); no duplicate pin in Dockerfile or
    goreleaser config. kyverno/chainsaw Go library in go.mod bumped in
    lockstep with the binary.

Alternatives Considered

Open Questions

  • Does the pinned kgateway-crds v2.0.0 chart at
    oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds ship exactly
    the six CRDs named in the in-repo values.yaml comment, or any
    additional ones? PR 3 must re-verify by pulling the chart.
  • How are dynamo-crds CRD names enumerated? The pinned chart is not
    vendored in-repo; PR 3 should document the extraction procedure and
    pin the enumerated list.
  • What workloads should the gke-nccl-tcpxo health check assert?
    Predecessor Add registry-driven deployment readiness for all components #622 named nccl-tcpxo-installer and device-injector
    DaemonSets in kube-system; confirm before backfill.
  • Does hydration carry HealthCheckAsserts content into the on-disk
    recipe.yaml for components disabled via overlay overrides.enabled: false? PR 1 should make this explicit.
  • The lint guard in PR 5 — surface at make qualify time (custom Go
    check in pkg/recipe test suite) or load-time at aicr recipe?

Related

Metadata

Metadata

Assignees

Type

Fields

No fields configured for Epic.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions