feat(validator): ship chainsaw binary; activate deployment-phase runner by mchmarny · Pull Request #1235 · NVIDIA/aicr

mchmarny · 2026-06-08T22:27:04Z

Summary

Activate the dormant Chainsaw runner so the HealthCheckAsserts
content hydrated by #1219 / #1234 actually executes during
aicr validate --phase deployment. Ships the chainsaw binary in the
deployment validator image, codifies the Job activeDeadlineSeconds
envelope, enforces a runtime read-only operation allowlist, and lifts
the previous mutex between ExpectedResources and registry-declared
HealthCheckAsserts.

Motivation / Context

Predecessor PR #1219 hydrated registry-declared
healthCheck.assertFile content onto ComponentRef.HealthCheckAsserts,
but execution was gated behind ChainsawBinary.Available() because the
deployment validator image had not yet shipped the binary. This PR
ships the binary, removes the gate, and codifies the surrounding
contracts (timeouts, read-only allowlist, dedup output) so the
deployment phase is now driven by registry-declared Chainsaw content.

Fixes: #1220
Related: #660, #1219, #1234, #1223

Type of Change

New feature (non-breaking change that adds functionality)

Component(s) Affected

Recipe engine / data (pkg/recipe)
Validator (validators/chainsaw, validators/deployment,
pkg/validator/catalog)
Build/CI/tooling (Makefile, Dockerfile, .settings.yaml consumption)
Docs (docs/contributor/validator.md)

Implementation Notes

Image

validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
downloads chainsaw_linux_amd64.tar.gz from the kyverno/chainsaw
release, verifies the sha256 against .settings.yaml, and COPYs
the binary into the distroless final stage at
/usr/local/bin/chainsaw.
Makefile image-validators target reads CHAINSAW_VERSION and
CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and
passes them as --build-arg only to the deployment-phase build
(conformance + performance images don't use chainsaw).
Single-arch (linux/amd64). The linux_arm64 checksum is already in
.settings.yaml so the multi-arch upgrade path is clear.
go.mod's github.com/kyverno/chainsaw v0.2.15 is already in
lockstep with the .settings.yaml pin (no bump needed).

Envelope

New defaults.JobEnvelopeMargin = 60s. Catalog timeout for
expected-resources becomes the Job's activeDeadlineSeconds;
both must exceed ChainsawAssertTimeout so chainsaw has headroom
to terminate, clean up its temp dir, and flush logs before SIGKILL.
recipes/validators/catalog.yaml: expected-resources timeout
bumped 5m → 7m (= 6m + 60s). TestExpectedResourcesCatalogEnvelope
in pkg/validator/catalog asserts the invariant — raising
ChainsawAssertTimeout later without bumping the catalog will fail
the test.

Read-only allowlist

New validators/chainsaw.ValidateTestReadOnly parses each
registry-declared Chainsaw Test YAML and rejects any operation
outside {assert, error}. State-changing (apply, create, delete,
patch, update) and side-effecting (script, command, wait, sleep,
podLogs, events, describe, get) operations are rejected with
ErrCodeInvalidRequest.
Catch / finally / cleanup blocks equivalently restricted.
TestValidateTestReadOnly_RegistryContent sweeps every
recipes/checks/*/health-check.yaml — guarantees no registry-
declared check violates the contract.
Bounds the cluster-admin RBAC posture of the deployment validator
Job (see pkg/validator/job/rbac.go). PR Registry lint guard for healthCheck.assertFile + allowlist (#660 PR 5) #1223 will add the same
enforcement at lint time.

Validator runtime

Dropped the && len(ref.ExpectedResources) == 0 gate in
validators/deployment/expected_resources.go. Both paths now run
per component. Output is source-tagged [expectedResources] /
[chainsaw].
Hard-removed ChainsawBinary.Available() and the associated PR
feat(recipe): hydrate healthCheck.assertFile + suppression sentinel #1231 skip path (per pre-PR decision). The binary is now a hard
requirement of the deployment image; a regression that drops it
surfaces as a clear RunTest error rather than a silent skip.
pkg/recipe.hydrateHealthCheckAsserts no longer skips when
ExpectedResources is set. The transitional skip added in fix(validator): address PR #1231 review on chainsaw hydration runtime #1234 is
reverted in lockstep with the validator-side gate drop so the
artifact matches what runs.

GPU deep-check migration (deferred)

clusterPolicyReady (issue enhance(validator): add targeted post-deployment GPU readiness checks #611 deep check) migration to
registry-declared Chainsaw YAML is deferred to a focused
follow-up. The migration must prove assertion equivalence against
expected_resources_test.go's heavily-mocked ClusterPolicy state
coverage; bundling that semantic argument here would bloat review.
Disposition documented in the new doc on verifyGPUReadinessSignals.
verifyNodewrightReady (formerly skyhookReady) stays in Go:
names are derived dynamically from recipe ManifestFiles at
validate-time. Static Chainsaw YAML cannot express the
dynamic-name selector.
verifyDRAKubeletPluginReady stays in Go: chart's full DaemonSet
name is release-derived. Encoding it would violate the
deployer-neutrality constraint ([EPIC] Reuse Chainsaw health checks in aicr validate --phase deployment #660 issue body).

Testing

go test -race -count=1 ./pkg/recipe/ ./pkg/defaults/ ./pkg/validator/catalog/ ./validators/...
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./pkg/defaults/... ./pkg/validator/catalog/... ./validators/...

New tests:

pkg/defaults.TestJobEnvelopeMarginInvariant — guards the
envelope contract.
pkg/validator/catalog.TestExpectedResourcesCatalogEnvelope —
guards the catalog timeout is >= ChainsawAssertTimeout + JobEnvelopeMargin.
validators/chainsaw.TestValidateTestReadOnly (9 sub-cases) +
TestValidateTestReadOnly_RegistryContent (sweeps in-tree checks).
validators/chainsaw.TestNewChainsawBinary (PATH hit + miss).
pkg/recipe.TestHydrateHealthCheckAsserts_RunsAlongsideExpectedResources
— replaces fix(validator): address PR #1231 review on chainsaw hydration runtime #1234's transitional _ExpectedResourcesWins test.

Coverage delta on changed packages:

validators/chainsaw: 14.4% → 22.1% (+7.7%).
pkg/defaults: 100% (unchanged).
pkg/validator/catalog: 97.5%.
validators/deployment: 44.1%.

Live-cluster validation required by #660 acceptance criteria —
deferred to @mchmarny per out-of-band agreement (no live cluster in
PR author's environment).

Risk Assessment

Medium — Touches the validator image build, drops a runtime
gate, and adds a runtime allowlist. Easy to revert per-component
(Dockerfile, catalog.yaml, runtime branches all isolated).

Rollout notes: No CLI-flag changes. Output gains source-tag
prefixes ([expectedResources] / [chainsaw]). Resolved
recipe.yaml artifacts for components with both assertFile AND
ExpectedResources now carry the hydrated chainsaw content
(previously suppressed transitionally). Recipe authors who declared
expectedResources to suppress hydrated content must add
healthCheckSkip: true instead (existing field from #1219). The
chainsaw binary is now required in the deployment validator image —
operators running custom images must rebuild.

Checklist

Tests pass locally (make test with -race)
Linter passes (changed paths)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs for user-facing behavior changes
(docs/contributor/validator.md)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-06-08T22:36:51Z

Coverage Report ✅

Metric	Value
Coverage	76.2%
Threshold	75%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-76.2%25-green)

Merging this branch will decrease overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/aicr/pkg/defaults	100.00% (ø)
github.com/NVIDIA/aicr/pkg/recipe	86.16% (-0.01%)	👎
github.com/NVIDIA/aicr/validators/chainsaw	0.00% (ø)
github.com/NVIDIA/aicr/validators/deployment	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/pkg/recipe/metadata_store.go	86.13% (-0.07%)	382 (-2)	329 (-2)	53	👎
github.com/NVIDIA/aicr/validators/chainsaw/allowlist.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/chainsaw/binary.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/chainsaw/runner.go	0.00% (ø)	0	0	0
github.com/NVIDIA/aicr/validators/deployment/expected_resources.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

yuanchen8911

Left comments. PTAL.

yuanchen8911

Merge-blocking CI build-arg gaps remain, and the Chainsaw read-only guard still misses top-level/future catch surfaces. README timeout drift is minor.

yuanchen8911 · 2026-06-08T23:53:49Z

The three GPU check failures look like an infrastructure/runner flake. Re-running the GPU jobs should clear it.

mchmarny · 2026-06-09T00:19:19Z

Tracking the GPU CI flake separately in #1237 — environmental, all 3 failing GPU jobs hit the same nvidia-container-toolkit 1.19.1-1 + containerd-restart fingerprint inside the kind worker. Not in #1235's blast radius (failure occurs in nvkind cluster create's internal apt-install on the worker, before any AICR validator code runs). Same trace appears on main since ~2026-06-07 19:00 UTC.

Live-cluster validation by @njhensley is the load-bearing GPU signal for this PR; the GPU-CI failures don't block that.

yuanchen8911

All the issues are fully resolved. Thanks. /lgtm

The only remaining red checks are the three GPU tests (Inference/Smoke/Training). I re-triggered the failed jobs — hopefully they pass this time.

PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.

Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.

Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.

mchmarny requested review from a team as code owners June 8, 2026 22:27

mchmarny added enhancement area/recipes area/validator area/infra labels Jun 8, 2026

mchmarny self-assigned this Jun 8, 2026

github-actions Bot added area/docs size/XL and removed area/infra labels Jun 8, 2026