fix(validator): address PR #1231 review on chainsaw hydration runtime#1234
Conversation
Three findings from @yuanchen8911 on the merged hydration PR: 1. Major — the deployment validator's bin.Available() gate over-blocked. chainsaw.Run dispatches raw K8s YAML through the Go assertion library (no binary needed) and Chainsaw Test format through the binary. The batch-level gate skipped both partitions when the binary was missing, so a future raw-YAML registry check would have been silently dropped once #1220 ships the image. Partition at the validator: gate only the Test-format subset; library-path asserts always run. Exported chainsaw.IsChainsawTest so the validator can classify per-assert. 2. Medium — NewChainsawBinary returned Available()==false whenever exec.LookPath missed, even when /usr/local/bin/chainsaw existed and was callable. A container image may install at a known absolute path without extending PATH (the validator image's eventual layout). Add an isExecutableFile probe of the canonical fallback path; report Available()==true if the file exists and has at least one exec bit. canonicalChainsawPath is a var (not const) so tests can swap it, avoiding flakes on dev machines that have chainsaw at the real path. 3. Medium — hydration stamped HealthCheckAsserts onto refs that also declared ExpectedResources, but the validator mutex at expected_resources.go:86 only runs the chainsaw path when ExpectedResources is empty. k8s-nim-operator is the canonical example: registry assertFile + overlay expectedResources → hydrated content was inert noise on the resolved recipe. Skip hydration when ExpectedResources is non-empty so the artifact matches what runs. PR #1220 will drop both this skip and the validator-side mutex simultaneously so the two paths run side-by-side. Tests: - TestHydrateHealthCheckAsserts_ExpectedResourcesWins — covers finding 3. - TestIsExecutableFile — 4-case table for the fallback probe. - TestNewChainsawBinary_Available gains "available via fallback path" subtest and patches canonicalChainsawPath to avoid dev-machine flakes. Refs #1219 (review), #1220 (follow-up that removes both the partition gate and the ExpectedResources skip once the binary ships).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (7)
📝 WalkthroughWalkthroughThis PR implements a conditional gating strategy for health check assertions based on component Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Coverage Report ✅
Coverage BadgeMerging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the HealthCheckAsserts content hydrated by #1219 / #1234 actually executes during aicr validate --phase deployment. Image ----- - validators/deployment/Dockerfile: new chainsaw-fetch multi-stage downloads the pinned chainsaw release tarball, verifies sha256 against the value in .settings.yaml, and COPYs the binary into the distroless final stage at /usr/local/bin/chainsaw. Single-arch (linux/amd64); the linux_arm64 checksum is already carried in .settings.yaml so the multi-arch upgrade path is clear. - Makefile image-validators target reads CHAINSAW_VERSION and CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes them as --build-arg only to the deployment phase build (conformance and performance images do not use chainsaw). - go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the .settings.yaml pin. Envelope -------- - defaults.JobEnvelopeMargin (60s) added on top of defaults.ChainsawAssertTimeout (6m). The catalog timeout for the expected-resources validator becomes the Job's activeDeadlineSeconds — both must exceed the chainsaw inner timeout so the binary has headroom to terminate, clean up its temp dir, and flush logs before SIGKILL. - recipes/validators/catalog.yaml's expected-resources timeout bumped 5m → 7m. TestExpectedResourcesCatalogEnvelope in pkg/validator/catalog asserts the invariant. Read-only allowlist ------------------- - validators/chainsaw/ValidateTestReadOnly parses each registry- declared Chainsaw Test YAML and rejects any operation outside the {assert, error} allowlist. Bounds the cluster-admin RBAC posture of the deployment validator Job: registry content cannot invoke state-changing (apply/create/delete/patch/update) or side-effecting (script/command/wait/sleep/podLogs/events/describe/get) operations. - Catch / finally / cleanup blocks are equivalently restricted; only assert/error appear in any pure read-only Chainsaw Test. - Test sweeps every recipes/checks/*/health-check.yaml to verify compliance. PR #1223 will add the same enforcement at lint time. Validator runtime ----------------- - Drop the &&len(ref.ExpectedResources)==0 gate in validators/deployment/expected_resources.go. Both ExpectedResources and HealthCheckAsserts paths now run per component. CLI output is source-tagged [expectedResources] / [chainsaw] so operators can disambiguate when both paths report on the same component. - Hard-remove ChainsawBinary.Available() and the associated PR #1231 skip path. The binary is now a hard requirement of the deployment validator image; a regression that drops it surfaces as a clear RunTest error rather than a silent skip. - pkg/recipe.hydrateHealthCheckAsserts no longer skips when ExpectedResources is set — the transitional skip added in #1234 is reverted in lockstep with the validator-side gate drop, so the artifact matches what runs. GPU deep-check migration (deferred) ----------------------------------- - clusterPolicyReady migration to registry-declared Chainsaw YAML is deferred to a focused follow-up. The migration must prove assertion equivalence against expected_resources_test.go's heavily-mocked ClusterPolicy state coverage; bundling that semantic argument here would bloat review. Documented in verifyGPUReadinessSignals doc. - verifyNodewrightReady (formerly skyhookReady) and verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names (recipe ManifestFiles / release-derived chart fullname) that static Chainsaw YAML cannot express without a chart-shape label upstream does not currently apply. Issue #660's deployer-neutrality constraint prohibits encoding the release-derived form. Docs ---- docs/contributor/validator.md's chainsaw section now explains the two-surface model (make check-health vs aicr validate), the runtime binary path, source-tagged CLI output, and the read-only allowlist. Live-cluster validation ----------------------- Required by #660 acceptance criteria; deferred to mchmarny per out-of-band agreement (no live cluster in PR author's environment). Closes #1220. Refs #660, #1219, #1234.
Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
Closes #1222. With this merged, epic #660 is fully done. The PR-time runtime contract from #1219/#1220/#1223 is in place; this PR closes the depth-of-coverage gap on the 21 health checks that landed before the epic and were shallow by today's standards. Per the audit in the PR body, three classes of enhancement: 1. DaemonSet partial-readiness fix (2 files): aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`, which passed on partial failures. Switched to `numberReady == desiredNumberScheduled` with `desiredNumberScheduled > 0` as the vacuous-pass guard — same pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in #1243. 2. Owned-CRD Established=True assertions (4 components): - gpu-operator: clusterpolicies.nvidia.com + the singleton `cluster-policy` CR with status.state == "ready". This MIGRATES the verifyClusterPolicyReady Go check that PR #1235 task 7 deferred for assertion-equivalence proof; the static Chainsaw assertion is now the canonical readiness signal. - dynamo-platform: dynamocomponentdeployments.nvidia.com. - nodewright-operator: skyhooks.skyhook.nvidia.com. - cert-manager: certificates / issuers / clusterissuers .cert-manager.io. CRD names verified against tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the Go verifyClusterPolicyReady check. 3. cert-manager scope fix: Previously asserted only the controller Deployment, missing cainjector + webhook. The webhook in particular is load-bearing for admission of any Certificate/Issuer CR; a healthy controller with a dead webhook would silently pass while the cluster rejects every cert-manager CR. Now asserts all three Deployments. 4. Universal container-state error coverage (21 files): Every check that previously only gated on pod phase (Pending / Failed / Unknown) now also rejects pods stuck in Running phase with a known-bad container-waiting reason (CrashLoopBackOff, ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The phase-only filter missed CrashLoopBackOff entirely because such pods typically remain in Running phase with unhealthy containers. Same pattern NFD adopted via #1234. Label-scoped in shared namespaces (kube-system, monitoring) to avoid sibling false-matches; namespace-only otherwise. Also addresses two post-merge review comments on PR #1244: - validators/chainsaw/allowlist_test.go: provider prefix changed from "." to "" to match the canonical defaultEmbeddedProvider used at runtime (head-scratch elimination, no behavior change — filepath.Join cleans both forms equivalently). - Same file: added an explicit CROSS-TEST DEPENDENCY comment block noting that empty-assertFile and unreadable-path failure modes are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck. If that test is ever removed or weakened, this test must be strengthened in lockstep so a broken registry can't silently green-light here. Out of scope (deferred to follow-up work): - CRD Established=True for the remaining 6 operators that own CRDs but where I lack high-confidence enumeration without pulling the charts: gatekeeper, kueue, kubeflow-trainer, network-operator, k8s-nim-operator, slinky-slurm-operator. Each needs verification against its pinned chart version; bundling them blindly would risk introducing typo'd names that fail on live clusters. Tracking as a possible follow-up after this PR proves the audit-and-enhance pattern works end-to-end. - StatefulSet readiness for kube-prometheus-stack's Prometheus / Alertmanager: similar reasoning — the chart's exact StatefulSet names and label conventions need verification. Allowlist sweep (TestValidateTestReadOnly_RegistryContent) + registry lint guard (TestComponentRegistry_RequiresHealthCheck) both pass: 27/27 components valid. Live-cluster validation deferred to mchmarny per the same agreement as #1235 / #1243. Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
Summary
Follow-up to merged PR #1231 addressing three review findings from
@yuanchen8911. No new functionality — tightens correctness around thehydration runtime gate and the registry/overlay interaction so the
artifact matches what the deployment validator actually runs.
Motivation / Context
After #1231 merged, three issues were flagged against
feat/1219-hydrate-healthcheck-assertfile@26aa48eb:bin.Available()gate atvalidators/deployment/expected_resources.goover-blocked raw-K8s-YAMLasserts.
chainsaw.Rundispatches raw YAML through the Go assertionlibrary and only Test format requires the binary. A future raw-YAML
registry check would have been silently dropped once Ship chainsaw binary + wire deployment-phase runner (#660 PR 2) #1220 ships the
image.
NewChainsawBinaryreturnedAvailable()==falsewhenexec.LookPathmissed, even if/usr/local/bin/chainsawexisted andwas callable. Container images may install at a known absolute path
without extending PATH.
HealthCheckAssertsonto refs thatalso declared
ExpectedResources. The validator's mutex(
expected_resources.go:86) only runs the chainsaw path whenExpectedResourcesis empty, so the hydrated content was inert.k8s-nim-operatoris the canonical example (registry assertFile +overlay
expectedResourcesviah100-eks-ubuntu-inference-nim.yaml).Fixes the runtime correctness in #1231 — does not change validation
outcomes for any current component.
Fixes:
Related: #1231, #1219, #1220, #660
Type of Change
Component(s) Affected
pkg/recipe)validators/deployment,validators/chainsaw)Implementation Notes
Finding 1 — per-assert partition
chainsaw.IsChainsawTestis now exported. The deployment validatorpartitions
chainsawAssertsinto Test-format vs raw-YAML, gates onlythe Test-format subset on
bin.Available(), and always runs the rawsubset. Skipped Test-format components are logged by name. All current
registry checks happen to be Test-format, so the per-batch warning
becomes a per-component warning today; the partition guarantees the
gate stays correct as raw-YAML checks land.
Finding 2 — fallback executability probe
NewChainsawBinaryfirst triesexec.LookPath, then falls back tostat-ing the canonical install path (
/usr/local/bin/chainsaw) for aregular file with at least one execute bit.
canonicalChainsawPathisa
var(notconst) so tests can swap it — necessary because realdeveloper machines often have
/usr/local/bin/chainsawinstalled,which would otherwise make the
Available()==falsesubtest flake.Finding 3 — skip hydration when ExpectedResources is set
hydrateHealthCheckAssertsskips whenlen(ref.ExpectedResources) > 0.This mirrors the deployment validator's mutex at
expected_resources.go:86so the artifact matches what runs. PR #1220will simultaneously drop both the validator mutex and this hydration
skip so the two paths run side-by-side with dedup/source-tag CLI
output.
Testing
go test -race -count=1 ./pkg/recipe/ ./validators/chainsaw/ ./validators/deployment/ golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./validators/chainsaw/... ./validators/deployment/...New tests:
TestHydrateHealthCheckAsserts_ExpectedResourcesWins— finding 3.TestIsExecutableFile— 4-case table covering executable / non-exec /missing / directory.
TestNewChainsawBinary_Availablegainsavailable via fallback path when PATH missessubtest and patchescanonicalChainsawPathpersubtest to avoid dev-machine flakes.
All tests + lint clean.
Risk Assessment
validation-outcome change for any current component (every registry
assertFile today is Test format, and
k8s-nim-operatoralready hadits assertFile content silently dropped by the validator mutex).
Rollout notes: No user-visible behavior change. Log messages now
warn per-component instead of per-batch when binary is missing; the
operator hint still links to #1220.
Checklist
make testwith-race)git commit -S)