You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Epic tracking the migration of aicr validate --phase deployment from its
current shallow baseline (Active namespaces, expectedResources, three
GPU-specific deep checks from #611) to registry-declared Chainsaw health
checks (Option 2 from #622).
This is a follow-up to the closed predecessor #622. The project has pivoted
away from the pure-Go readiness contract proposed there. Instead, the
existing Chainsaw assertions under recipes/checks/* — already referenced
by healthCheck.assertFile in recipes/registry.yaml but unused at runtime
— become the source of truth for post-install readiness.
Work is broken into 5 child issues that land independently.
Problem / Use Case
After #611, deployment validation is stronger but still asymmetric:
baseline checks run for enabled components (Active namespaces, healthy
declared expectedResources)
targeted deep checks run for three GPU-specific gaps
(skyhook-customizations, nvidia-dra-driver-gpu, gpu-operator)
most other components get only shallow validation unless they declare expectedResources
Meanwhile, recipes/checks/* already contains Chainsaw health checks for
19 of the 22 registry components, and healthCheck.assertFile is already
declared for them in recipes/registry.yaml. These checks are the
highest-fidelity readiness logic the project has, but they are not wired
into aicr validate --phase deployment today:
healthCheck.assertFile is not hydrated into ComponentRef.HealthCheckAsserts during recipe resolution
pkg/recipe/metadata.go intentionally skips loading the assert content
the deployment validator image does not ship the chainsaw binary
the deployment-phase Chainsaw runner that exists in the code path is
therefore dormant
The result: the existing Chainsaw assertions are maintained but unused by aicr validate --phase deployment, and the deployment phase re-implements
a shallower version of the same intent.
Goal
Make aicr validate --phase deployment execute the existing Chainsaw
health checks for every registry component that has one, enhance / backfill
those checks so the deployment phase provides deep, symmetric, post-install
validation across the registry, and lock the invariant with lint
enforcement.
No new validation phase. No separate standalone validator. Reuse of the
existing recipes/checks/* content is the contract.
Validator Jobs consume only the mounted recipe.yaml; they do not read registry.yaml at runtime.
No Helm API calls, no release-metadata reads, no release-scoped label
dependencies (app.kubernetes.io/instance).
Validator Jobs continue to run under cluster-admin-bound ServiceAccount
(see pkg/validator/job/rbac.go:41-67); registry-declared assert content
is restricted to a read-only allowlist (assert and error only) to
bound that posture.
Versions / checksums for chainsaw come from .settings.yaml only
(chainsaw and chainsaw_checksums); no duplicate pin in Dockerfile or
goreleaser config. kyverno/chainsaw Go library in go.mod bumped in
lockstep with the binary.
Alternatives Considered
Pure-Go readiness / customChecks / crds contract (Option 1 in Add registry-driven deployment readiness for all components #622).
Not pursued. Deep readiness would have to be re-expressed in the registry
while equivalent Chainsaw content already exists, producing two parallel
sources of readiness truth.
Does the pinned kgateway-crds v2.0.0 chart at oci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crds ship exactly
the six CRDs named in the in-repo values.yaml comment, or any
additional ones? PR 3 must re-verify by pulling the chart.
How are dynamo-crds CRD names enumerated? The pinned chart is not
vendored in-repo; PR 3 should document the extraction procedure and
pin the enumerated list.
Does hydration carry HealthCheckAsserts content into the on-disk recipe.yaml for components disabled via overlay overrides.enabled: false? PR 1 should make this explicit.
The lint guard in PR 5 — surface at make qualify time (custom Go
check in pkg/recipe test suite) or load-time at aicr recipe?
Summary
Epic tracking the migration of
aicr validate --phase deploymentfrom itscurrent shallow baseline (
Activenamespaces,expectedResources, threeGPU-specific deep checks from #611) to registry-declared Chainsaw health
checks (Option 2 from #622).
This is a follow-up to the closed predecessor #622. The project has pivoted
away from the pure-Go
readinesscontract proposed there. Instead, theexisting Chainsaw assertions under
recipes/checks/*— already referencedby
healthCheck.assertFileinrecipes/registry.yamlbut unused at runtime— become the source of truth for post-install readiness.
Work is broken into 5 child issues that land independently.
Problem / Use Case
After #611, deployment validation is stronger but still asymmetric:
Activenamespaces, healthydeclared
expectedResources)(
skyhook-customizations,nvidia-dra-driver-gpu,gpu-operator)expectedResourcesMeanwhile,
recipes/checks/*already contains Chainsaw health checks for19 of the 22 registry components, and
healthCheck.assertFileis alreadydeclared for them in
recipes/registry.yaml. These checks are thehighest-fidelity readiness logic the project has, but they are not wired
into
aicr validate --phase deploymenttoday:healthCheck.assertFileis not hydrated intoComponentRef.HealthCheckAssertsduring recipe resolutionpkg/recipe/metadata.gointentionally skips loading the assert contentchainsawbinarytherefore dormant
The result: the existing Chainsaw assertions are maintained but unused by
aicr validate --phase deployment, and the deployment phase re-implementsa shallower version of the same intent.
Goal
Make
aicr validate --phase deploymentexecute the existing Chainsawhealth checks for every registry component that has one, enhance / backfill
those checks so the deployment phase provides deep, symmetric, post-install
validation across the registry, and lock the invariant with lint
enforcement.
No new validation phase. No separate standalone validator. Reuse of the
existing
recipes/checks/*content is the contract.Child Issues
healthCheck.assertFile+ suppression sentinel (pkg/recipe). No runtime behavior change.chainsawbinary in validator image + wire runner into deployment phase. Named-constant timeouts/margin/parallelism. Relaxvalidators/deployment/expected_resources.go:86gate. Read-only allowlist (assert/erroronly). Dedup/source-tag CLI output.gke-nccl-tcpxo,dynamo-crds,kgateway-crds. Live-run againsth100-eks-ubuntu-inference-dynamo.healthCheck.assertFilefor new components; enforce read-only assertion allowlist on registry-declared content.Established=Truefor CRDs, tighter selectors in shared namespaces.Constraints (apply to every child)
recipe.yaml; they do not readregistry.yamlat runtime.dependencies (
app.kubernetes.io/instance).cluster-admin-bound ServiceAccount(see
pkg/validator/job/rbac.go:41-67); registry-declared assert contentis restricted to a read-only allowlist (
assertanderroronly) tobound that posture.
chainsawcome from.settings.yamlonly(
chainsawandchainsaw_checksums); no duplicate pin in Dockerfile orgoreleaser config.
kyverno/chainsawGo library ingo.modbumped inlockstep with the binary.
Alternatives Considered
readiness/customChecks/crdscontract (Option 1 in Add registry-driven deployment readiness for all components #622).Not pursued. Deep readiness would have to be re-expressed in the registry
while equivalent Chainsaw content already exists, producing two parallel
sources of readiness truth.
Add registry-driven deployment readiness for all components #622; shared namespaces (
kube-system,monitoring) would produce falsematches.
scale and keeps deep readiness asymmetric.
Open Questions
kgateway-crdsv2.0.0 chart atoci://cr.kgateway.dev/kgateway-dev/charts/kgateway-crdsship exactlythe six CRDs named in the in-repo
values.yamlcomment, or anyadditional ones? PR 3 must re-verify by pulling the chart.
dynamo-crdsCRD names enumerated? The pinned chart is notvendored in-repo; PR 3 should document the extraction procedure and
pin the enumerated list.
gke-nccl-tcpxohealth check assert?Predecessor Add registry-driven deployment readiness for all components #622 named
nccl-tcpxo-installeranddevice-injectorDaemonSets in
kube-system; confirm before backfill.HealthCheckAssertscontent into the on-diskrecipe.yamlfor components disabled via overlayoverrides.enabled: false? PR 1 should make this explicit.make qualifytime (custom Gocheck in
pkg/recipetest suite) or load-time ataicr recipe?Related