Skip to content

feat(recipe): hydrate healthCheck.assertFile + suppression sentinel#1231

Merged
mchmarny merged 2 commits into
mainfrom
feat/1219-hydrate-healthcheck-assertfile
Jun 8, 2026
Merged

feat(recipe): hydrate healthCheck.assertFile + suppression sentinel#1231
mchmarny merged 2 commits into
mainfrom
feat/1219-hydrate-healthcheck-assertfile

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Hydrate registry-declared healthCheck.assertFile content onto
ComponentRef.HealthCheckAsserts during recipe resolution. Add an
overlay-driven suppression sentinel as the rollback path. PR 1 of 5
under epic #660.

Motivation / Context

Today recipes/checks/* carries Chainsaw health checks for 22 registry
components, referenced by healthCheck.assertFile in
recipes/registry.yaml, but the content is never loaded:
pkg/recipe/metadata.go intentionally skipped hydration because the
deployment validator image does not ship the chainsaw binary, so
populating ref.HealthCheckAsserts would activate
validators/deployment/expected_resources.go:86 and fail every
component invocation. The dormant content is the foundation for
deepening aicr validate --phase deployment — see epic #660 (
predecessor #622, ADR #631).

This PR introduces the hydration mechanism only. Runtime behavior is
preserved by gating activation on ChainsawBinary.Available(); the
binary ships in #1220.

Fixes: #1219
Related: #660, #1220, #622, #631

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Validator (validators/deployment, validators/chainsaw)

Implementation Notes

  • Hydration (pkg/recipe/metadata_store.go): applyRegistryDefaults
    now calls hydrateHealthCheckAsserts, which loads each
    registry-declared assertFile through the bound DataProvider and
    stamps the content onto the matching ComponentRef. Routing goes
    through the per-result provider so per-tenant isolation and external
    --data overlays continue to work.
  • Suppression sentinel: new HealthCheckSkip bool field on
    ComponentRef. Leaf overlay or external --data overlay sets it to
    clear inherited hydration (rollback for a regressing upstream check).
    Merge semantics mirror Cleanup (set-if-true).
    mixinComponentRefSafeForMerge rejects mixins that set it so a mixin
    cannot silently suppress an inherited check.
  • Inline-wins: when an overlay declares HealthCheckAsserts inline,
    hydration leaves it alone — never silently overwrite caller intent.
  • Disabled components: hydrated unconditionally so the on-disk
    recipe.yaml artifact carries the same content regardless of
    enablement; runtime filtering by enabledComponentRefs strips them
    from execution.
  • Runtime gate (validators/chainsaw/binary.go,
    validators/deployment/expected_resources.go): ChainsawBinary gains
    an Available() method (returns false when exec.LookPath("chainsaw")
    fails). The deployment validator checks it before dispatching
    hydrated assert content; absent binary logs once and skips, preserving
    today's behavior (these components had no validation path at all
    pre-hydration). PR Ship chainsaw binary + wire deployment-phase runner (#660 PR 2) #1220 ships the binary and the gate naturally
    lights up.
  • Stale NOTE rewrite: the // NOTE: healthCheck.assertFile content is intentionally NOT loaded here. block in
    pkg/recipe/metadata.go:199-205 now describes the new contract
    instead of the abandoned reason.

Testing

go test -race -count=1 ./pkg/recipe/... ./validators/...
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./validators/deployment/... ./validators/chainsaw/...
  • New pkg/recipe/hydrate_test.go covers:
    • hydration from registry assertFile
    • skip sentinel suppresses
    • inline HealthCheckAsserts is preserved (overlay wins)
    • unknown-component no-op
    • registered-but-no-assertFile no-op
    • provider read failure → wrapped ErrCodeInternal with component + assertFile context
    • nil-provider falls back to embedded
    • merge propagates HealthCheckSkip (set-if-true)
    • mixin merge rejects HealthCheckSkip
  • New validators/chainsaw/binary_test.go covers both branches of
    ChainsawBinary.Available() (PATH miss → false, PATH hit → true).
  • Coverage: pkg/recipe 86.0% → 86.2% (+0.2%); validators/chainsaw
    11.2% → 14.4% (+3.2%).
  • All existing pkg/recipe, validators/deployment,
    validators/chainsaw tests pass under -race.
  • pkg/trust TestUpdate_Success fails locally due to Sigstore CDN
    network restrictions in my sandbox — pre-existing, not from this PR.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Validation outcomes are unchanged — the chainsaw
dispatch path is skipped because ChainsawBinary.Available() returns
false until #1220 ships the binary, and components with a registry
assertFile had no validation path at all pre-hydration. One
user-visible delta: a single slog.Warn("chainsaw binary not available; skipping registry-declared health check assertions", ...)
now fires per deployment-phase run when hydrated content is present
and the binary is missing. The resolved recipe + bundled recipe.yaml
now carry healthCheckAsserts content (previously empty); downstream
consumers tolerate the extra field since it was already in the
ComponentRef schema. To revert: drop the hydration call from
applyRegistryDefaults and remove HealthCheckSkip / the Available()
gate.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (changed paths)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • No user-facing CLI/API/recipe-field behavior changed (docs
    update follows in Ship chainsaw binary + wire deployment-phase runner (#660 PR 2) #1220 when the runtime path activates)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.2%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.2%25-green)

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/recipe 86.16% (+0.15%) 👍
github.com/NVIDIA/aicr/validators/chainsaw 0.00% (ø)
github.com/NVIDIA/aicr/validators/deployment 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/recipe/metadata.go 91.51% (+0.05%) 318 (+2) 291 (+2) 27 👍
github.com/NVIDIA/aicr/pkg/recipe/metadata_store.go 86.13% (+0.73%) 382 (+19) 329 (+19) 53 👍
github.com/NVIDIA/aicr/validators/chainsaw/binary.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/deployment/expected_resources.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 2c700712-e978-408c-a218-bd2834e3ebcf

📥 Commits

Reviewing files that changed from the base of the PR and between 6d03083 and 330e599.

📒 Files selected for processing (6)
  • pkg/recipe/hydrate_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_store.go
  • validators/chainsaw/binary.go
  • validators/chainsaw/binary_test.go
  • validators/deployment/expected_resources.go

📝 Walkthrough

Walkthrough

This PR adds ComponentRef.HealthCheckSkip and defers registry healthCheck.assertFile hydration to a new hydrateHealthCheckAsserts helper that reads assert files via the bound DataProvider (with a defaults.FileReadTimeout), stamps content onto ComponentRef.HealthCheckAsserts unless skipped/inline/absent, and returns structured errors on read failure. mergeComponentRef and mixinComponentRefSafeForMerge are updated to enforce set-if-true merge semantics and mixin-safety for HealthCheckSkip. ChainsawBinary gains Available() and NewChainsawBinary probes exec.LookPath to set availability; the deployment validator skips running chainsaw and logs once when unavailable. Tests cover hydration paths, error handling, merge/mixin rules, and chainsaw availability.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #660: Implements the recipe-resolution hydration of healthCheck.assertFile into ComponentRef.HealthCheckAsserts and the suppression sentinel; this PR implements the hydration and gating pieces referenced by the epic.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main feature: hydrating registry-declared healthCheck.assertFile content and adding a suppression sentinel mechanism.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, detailing the hydration mechanism, suppression sentinel, runtime gating, and test coverage.
Linked Issues check ✅ Passed The PR fully implements the stated objectives from issue #1219: hydration of healthCheck.assertFile into ComponentRef.HealthCheckAsserts [#1219], suppression sentinel via HealthCheckSkip [#1219], overlay merge semantics [#1219], inline-wins preservation [#1219], disabled-component handling [#1219], NOTE rewrite [#1219], ChainsawBinary.Available() gate [#1219], and comprehensive test coverage [#1219].
Out of Scope Changes check ✅ Passed All changes are scoped to the stated mechanism: recipe hydration, suppression sentinel, and availability gating. No changes to runtime behavior or validator activation; no backfilling or enhancement of checks. The PR explicitly defers shipping the chainsaw binary and activation to later PRs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/1219-hydrate-healthcheck-assertfile

Comment @coderabbitai help to get the list of available commands and usage tips.

@mchmarny mchmarny force-pushed the feat/1219-hydrate-healthcheck-assertfile branch from 7986c76 to 6d03083 Compare June 8, 2026 21:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/recipe/metadata_store.go`:
- Around line 1208-1215: The current use of aicrerrors.WrapWithContext when
handling provider.ReadFile failures (the healthCheck.assertFile branch) discards
provider-level error codes; replace the WrapWithContext call with a propagation
that preserves existing structured codes by using errors.PropagateOrWrap(err,
aicrerrors.ErrCodeInternal, "failed to read healthCheck.assertFile") while
keeping the same context map; update the error return site (the location that
currently calls aicrerrors.WrapWithContext) to call errors.PropagateOrWrap so
timeouts/cancellations from ReadFile survive propagation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 30ed8f1d-3486-4570-a2f2-8fad702f2747

📥 Commits

Reviewing files that changed from the base of the PR and between 7986c76 and 6d03083.

📒 Files selected for processing (6)
  • pkg/recipe/hydrate_test.go
  • pkg/recipe/metadata.go
  • pkg/recipe/metadata_store.go
  • validators/chainsaw/binary.go
  • validators/chainsaw/binary_test.go
  • validators/deployment/expected_resources.go

Comment thread pkg/recipe/metadata_store.go Outdated
lockwobr
lockwobr previously approved these changes Jun 8, 2026
Loads each registry-declared healthCheck.assertFile through the bound
DataProvider during recipe resolution and stamps the content onto the
matching ComponentRef.HealthCheckAsserts. This is the foundation for
running the existing Chainsaw assertions under recipes/checks/* as the
deployment-phase readiness check (epic #660).

PR 1 of 5 — mechanism only. Hydrated content is dormant at runtime
because the deployment validator image does not yet ship the chainsaw
binary; activation is gated in validators/deployment on
ChainsawBinary.Available() so the new chainsaw dispatch path is a
no-op until issue #1220 lands. Without this gate every component with
a registry-declared assertFile would trigger a failed binary invocation
once hydration populates the field.

Adds HealthCheckSkip overlay sentinel as the rollback path for a
regressing upstream check (and the way an external --data overlay
clears registry-declared content). Merge semantics mirror Cleanup:
set-if-true. mixinComponentRefSafeForMerge rejects mixins that set it
so a mixin cannot silently suppress an inherited check.

Rewrites the stale "intentionally NOT loaded here" NOTE in
ApplyRegistryDefaults to describe the new hydration contract instead
of the abandoned reason.

Refs #1219, #660.
@mchmarny mchmarny force-pushed the feat/1219-hydrate-healthcheck-assertfile branch from 6d03083 to 330e599 Compare June 8, 2026 21:40
@mchmarny mchmarny enabled auto-merge (squash) June 8, 2026 21:41

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todo

@mchmarny mchmarny merged commit 57fbed0 into main Jun 8, 2026
33 of 34 checks passed
@mchmarny mchmarny deleted the feat/1219-hydrate-healthcheck-assertfile branch June 8, 2026 21:59

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three inline findings below. All are latent today because the chainsaw binary isn't shipped yet (#1220 gates the branch off), but each becomes an active correctness gap the moment the binary lands.

// skip; this preserves the pre-hydration runtime behavior (these
// components had no validation path at all before).
bin := chainsaw.NewChainsawBinary()
if !bin.Available() {

@yuanchen8911 yuanchen8911 Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw asserts silently skipped when the chainsaw binary is absent → false PASS.

This if !bin.Available() gate skips the entire chainsawAsserts batch. But chainsaw.Run dispatches raw K8s resource YAML to the Go assertion path (validators/chainsaw/runner.go:123-130, assertRawResources at :184-234) — only kind: Test content actually needs the CLI binary. So an inline overlay HealthCheckAsserts or an external --data raw assert is silently ignored, and deployment validation can pass without ever evaluating it.

Registry-declared checks are all kind: Test today, so this bites inline/external-data raw asserts specifically.

Suggested fix: partition the asserts — gate only the Test-format subset on bin.Available(), and always run the raw-YAML subset through the Go library.

// Fall through with the canonical install path; RunTest will surface
// the missing-binary error if invoked, but Available() reports false
// so the deployment validator can short-circuit upstream.
return &chainsawBinary{binPath: "/usr/local/bin/chainsaw", available: false}

@yuanchen8911 yuanchen8911 Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Available() reports false even when the fallback binary is callable.

NewChainsawBinary returns available: false whenever exec.LookPath("chainsaw") fails, without checking the documented fallback /usr/local/bin/chainsaw. If chainsaw is installed there but that dir isn't on PATH, RunTest would invoke it successfully — yet Available() reports false and the deployment validator skips all Test-format checks.

Suggested fix: os.Stat + executable-bit check the fallback path before returning unavailable, so the new gate doesn't disable a binary that RunTest can actually run.

@@ -104,25 +104,40 @@ func checkExpectedResources(ctx *validators.Context) error {
failures = append(failures, verifyGPUReadinessSignals(ctx, enabledRefs)...)

if len(chainsawAsserts) > 0 {

@yuanchen8911 yuanchen8911 Jun 8, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hydrated HealthCheckAsserts and ExpectedResources are mutually exclusive at runtime (refers to the enqueue gate at line 86: if ref.HealthCheckAsserts != "" && len(ref.ExpectedResources) == 0).

Asserts are only enqueued when len(ref.ExpectedResources) == 0, but hydration now stamps registry assert content onto refs that also declare expected resources — so the chainsaw check is silently dropped for them.

In-tree example: k8s-nim-operator has a registry assertFile (recipes/registry.yaml:515-521) and overlay expectedResources (recipes/overlays/h100-eks-ubuntu-inference-nim.yaml:43-55).

Suggested fix: either run both checks, or make assertFile fallback-only explicit and skip hydration when ExpectedResources is present (and document it on the field).

@mchmarny

mchmarny commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

@yuanchen8911 — thanks for the careful cross-review. All three findings addressed in follow-up #1234:

  1. Major (binary gate over-blocks raw YAML): validator now partitions Test-format vs raw-YAML asserts and gates only the Test-format subset on bin.Available(). chainsaw.IsChainsawTest exported for the classifier.
  2. Medium (fallback path executability): NewChainsawBinary now stats /usr/local/bin/chainsaw for an executable regular file after exec.LookPath misses, so Available() is consistent with what RunTest would actually invoke.
  3. Medium (hydration + ExpectedResources mutex): hydrateHealthCheckAsserts now skips when ExpectedResources is non-empty, so the resolved recipe matches what the validator actually runs. PR Ship chainsaw binary + wire deployment-phase runner (#660 PR 2) #1220 will simultaneously drop both the validator-side mutex and this hydration skip so both paths run side-by-side.

Tests added for each. Dismissals confirmed (external read bound + mock surface) — no further action there.

mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 8, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hydrate healthCheck.assertFile + suppression sentinel (#660 PR 1)

4 participants