Skip to content

fix(validator): address PR #1231 review on chainsaw hydration runtime#1234

Merged
mchmarny merged 1 commit into
mainfrom
feat/chainsaw-runtime-review-fixes
Jun 8, 2026
Merged

fix(validator): address PR #1231 review on chainsaw hydration runtime#1234
mchmarny merged 1 commit into
mainfrom
feat/chainsaw-runtime-review-fixes

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Follow-up to merged PR #1231 addressing three review findings from
@yuanchen8911. No new functionality — tightens correctness around the
hydration runtime gate and the registry/overlay interaction so the
artifact matches what the deployment validator actually runs.

Motivation / Context

After #1231 merged, three issues were flagged against
feat/1219-hydrate-healthcheck-assertfile@26aa48eb:

  1. Major — the bin.Available() gate at
    validators/deployment/expected_resources.go over-blocked raw-K8s-YAML
    asserts. chainsaw.Run dispatches raw YAML through the Go assertion
    library and only Test format requires the binary. A future raw-YAML
    registry check would have been silently dropped once Ship chainsaw binary + wire deployment-phase runner (#660 PR 2) #1220 ships the
    image.
  2. MediumNewChainsawBinary returned Available()==false when
    exec.LookPath missed, even if /usr/local/bin/chainsaw existed and
    was callable. Container images may install at a known absolute path
    without extending PATH.
  3. Medium — hydration stamped HealthCheckAsserts onto refs that
    also declared ExpectedResources. The validator's mutex
    (expected_resources.go:86) only runs the chainsaw path when
    ExpectedResources is empty, so the hydrated content was inert.
    k8s-nim-operator is the canonical example (registry assertFile +
    overlay expectedResources via h100-eks-ubuntu-inference-nim.yaml).

Fixes the runtime correctness in #1231 — does not change validation
outcomes for any current component.

Fixes:
Related: #1231, #1219, #1220, #660

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Validator (validators/deployment, validators/chainsaw)

Implementation Notes

Finding 1 — per-assert partition

chainsaw.IsChainsawTest is now exported. The deployment validator
partitions chainsawAsserts into Test-format vs raw-YAML, gates only
the Test-format subset on bin.Available(), and always runs the raw
subset. Skipped Test-format components are logged by name. All current
registry checks happen to be Test-format, so the per-batch warning
becomes a per-component warning today; the partition guarantees the
gate stays correct as raw-YAML checks land.

Finding 2 — fallback executability probe

NewChainsawBinary first tries exec.LookPath, then falls back to
stat-ing the canonical install path (/usr/local/bin/chainsaw) for a
regular file with at least one execute bit. canonicalChainsawPath is
a var (not const) so tests can swap it — necessary because real
developer machines often have /usr/local/bin/chainsaw installed,
which would otherwise make the Available()==false subtest flake.

Finding 3 — skip hydration when ExpectedResources is set

hydrateHealthCheckAsserts skips when len(ref.ExpectedResources) > 0.
This mirrors the deployment validator's mutex at
expected_resources.go:86 so the artifact matches what runs. PR #1220
will simultaneously drop both the validator mutex and this hydration
skip so the two paths run side-by-side with dedup/source-tag CLI
output.

Testing

go test -race -count=1 ./pkg/recipe/ ./validators/chainsaw/ ./validators/deployment/
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./validators/chainsaw/... ./validators/deployment/...

New tests:

  • TestHydrateHealthCheckAsserts_ExpectedResourcesWins — finding 3.
  • TestIsExecutableFile — 4-case table covering executable / non-exec /
    missing / directory.
  • TestNewChainsawBinary_Available gains available via fallback path when PATH misses subtest and patches canonicalChainsawPath per
    subtest to avoid dev-machine flakes.

All tests + lint clean.

Risk Assessment

  • Low — Isolated changes, well-tested, easy to revert. No
    validation-outcome change for any current component (every registry
    assertFile today is Test format, and k8s-nim-operator already had
    its assertFile content silently dropped by the validator mutex).

Rollout notes: No user-visible behavior change. Log messages now
warn per-component instead of per-batch when binary is missing; the
operator hint still links to #1220.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (changed paths)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • No user-facing CLI/API/recipe-field behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Three findings from @yuanchen8911 on the merged hydration PR:

1. Major — the deployment validator's bin.Available() gate over-blocked.
   chainsaw.Run dispatches raw K8s YAML through the Go assertion library
   (no binary needed) and Chainsaw Test format through the binary. The
   batch-level gate skipped both partitions when the binary was missing,
   so a future raw-YAML registry check would have been silently dropped
   once #1220 ships the image. Partition at the validator: gate only the
   Test-format subset; library-path asserts always run. Exported
   chainsaw.IsChainsawTest so the validator can classify per-assert.

2. Medium — NewChainsawBinary returned Available()==false whenever
   exec.LookPath missed, even when /usr/local/bin/chainsaw existed and
   was callable. A container image may install at a known absolute
   path without extending PATH (the validator image's eventual layout).
   Add an isExecutableFile probe of the canonical fallback path; report
   Available()==true if the file exists and has at least one exec bit.
   canonicalChainsawPath is a var (not const) so tests can swap it,
   avoiding flakes on dev machines that have chainsaw at the real path.

3. Medium — hydration stamped HealthCheckAsserts onto refs that also
   declared ExpectedResources, but the validator mutex at
   expected_resources.go:86 only runs the chainsaw path when
   ExpectedResources is empty. k8s-nim-operator is the canonical
   example: registry assertFile + overlay expectedResources → hydrated
   content was inert noise on the resolved recipe. Skip hydration when
   ExpectedResources is non-empty so the artifact matches what runs.
   PR #1220 will drop both this skip and the validator-side mutex
   simultaneously so the two paths run side-by-side.

Tests:

- TestHydrateHealthCheckAsserts_ExpectedResourcesWins — covers finding 3.
- TestIsExecutableFile — 4-case table for the fallback probe.
- TestNewChainsawBinary_Available gains "available via fallback path"
  subtest and patches canonicalChainsawPath to avoid dev-machine flakes.

Refs #1219 (review), #1220 (follow-up that removes both the partition
gate and the ExpectedResources skip once the binary ships).
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e085a74f-dddc-4687-86b4-a26b97eedb82

📥 Commits

Reviewing files that changed from the base of the PR and between 57fbed0 and 68dc58b.

📒 Files selected for processing (7)
  • pkg/recipe/hydrate_test.go
  • pkg/recipe/metadata_store.go
  • validators/chainsaw/binary.go
  • validators/chainsaw/binary_test.go
  • validators/chainsaw/runner.go
  • validators/chainsaw/runner_test.go
  • validators/deployment/expected_resources.go

📝 Walkthrough

Walkthrough

This PR implements a conditional gating strategy for health check assertions based on component ExpectedResources settings and Chainsaw binary availability. When a component overlay declares ExpectedResources, recipe-side hydration of HealthCheckAsserts is suppressed. Validators then detect Chainsaw Test-format assertions and execute them only when the Chainsaw binary is available; remaining assertions execute regardless. The Chainsaw binary discovery now probes a fallback canonical path (/usr/local/bin/chainsaw) after exec.LookPath fails, and a new exported helper IsChainsawTest enables per-assertion format detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

  • NVIDIA/aicr#1231: Modifies pkg/recipe/metadata_store.go's hydrateHealthCheckAsserts logic and extends pkg/recipe/hydrate_test.go with coverage for hydration suppression and precedence rules

Suggested labels

size/M, area/tests

Suggested reviewers

  • lockwobr
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references a follow-up to PR #1231 addressing review findings on chainsaw hydration runtime, which directly aligns with the main changes addressing three specific issues in the code.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, findings, implementation details, testing, and risk assessment for all changes present in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/chainsaw-runtime-review-fixes

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/recipe 86.17% (+0.01%) 👍
github.com/NVIDIA/aicr/validators/chainsaw 0.00% (ø)
github.com/NVIDIA/aicr/validators/deployment 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/recipe/metadata_store.go 86.20% (+0.07%) 384 (+2) 331 (+2) 53 👍
github.com/NVIDIA/aicr/validators/chainsaw/binary.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/chainsaw/runner.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/deployment/expected_resources.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny merged commit 767f17d into main Jun 8, 2026
33 of 34 checks passed
@mchmarny mchmarny deleted the feat/chainsaw-runtime-review-fixes branch June 8, 2026 22:20
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants