Skip to content

feat(validator): ship chainsaw binary; activate deployment-phase runner#1235

Merged
mchmarny merged 1 commit into
mainfrom
feat/1220-chainsaw-deployment-runner
Jun 9, 2026
Merged

feat(validator): ship chainsaw binary; activate deployment-phase runner#1235
mchmarny merged 1 commit into
mainfrom
feat/1220-chainsaw-deployment-runner

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 8, 2026

Copy link
Copy Markdown
Member

Summary

Activate the dormant Chainsaw runner so the HealthCheckAsserts
content hydrated by #1219 / #1234 actually executes during
aicr validate --phase deployment. Ships the chainsaw binary in the
deployment validator image, codifies the Job activeDeadlineSeconds
envelope, enforces a runtime read-only operation allowlist, and lifts
the previous mutex between ExpectedResources and registry-declared
HealthCheckAsserts.

Motivation / Context

Predecessor PR #1219 hydrated registry-declared
healthCheck.assertFile content onto ComponentRef.HealthCheckAsserts,
but execution was gated behind ChainsawBinary.Available() because the
deployment validator image had not yet shipped the binary. This PR
ships the binary, removes the gate, and codifies the surrounding
contracts (timeouts, read-only allowlist, dedup output) so the
deployment phase is now driven by registry-declared Chainsaw content.

Fixes: #1220
Related: #660, #1219, #1234, #1223

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Validator (validators/chainsaw, validators/deployment,
    pkg/validator/catalog)
  • Build/CI/tooling (Makefile, Dockerfile, .settings.yaml consumption)
  • Docs (docs/contributor/validator.md)

Implementation Notes

Image

  • validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
    downloads chainsaw_linux_amd64.tar.gz from the kyverno/chainsaw
    release, verifies the sha256 against .settings.yaml, and COPYs
    the binary into the distroless final stage at
    /usr/local/bin/chainsaw.
  • Makefile image-validators target reads CHAINSAW_VERSION and
    CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and
    passes them as --build-arg only to the deployment-phase build
    (conformance + performance images don't use chainsaw).
  • Single-arch (linux/amd64). The linux_arm64 checksum is already in
    .settings.yaml so the multi-arch upgrade path is clear.
  • go.mod's github.com/kyverno/chainsaw v0.2.15 is already in
    lockstep with the .settings.yaml pin (no bump needed).

Envelope

  • New defaults.JobEnvelopeMargin = 60s. Catalog timeout for
    expected-resources becomes the Job's activeDeadlineSeconds;
    both must exceed ChainsawAssertTimeout so chainsaw has headroom
    to terminate, clean up its temp dir, and flush logs before SIGKILL.
  • recipes/validators/catalog.yaml: expected-resources timeout
    bumped 5m → 7m (= 6m + 60s). TestExpectedResourcesCatalogEnvelope
    in pkg/validator/catalog asserts the invariant — raising
    ChainsawAssertTimeout later without bumping the catalog will fail
    the test.

Read-only allowlist

  • New validators/chainsaw.ValidateTestReadOnly parses each
    registry-declared Chainsaw Test YAML and rejects any operation
    outside {assert, error}. State-changing (apply, create, delete,
    patch, update) and side-effecting (script, command, wait, sleep,
    podLogs, events, describe, get) operations are rejected with
    ErrCodeInvalidRequest.
  • Catch / finally / cleanup blocks equivalently restricted.
  • TestValidateTestReadOnly_RegistryContent sweeps every
    recipes/checks/*/health-check.yaml — guarantees no registry-
    declared check violates the contract.
  • Bounds the cluster-admin RBAC posture of the deployment validator
    Job (see pkg/validator/job/rbac.go). PR Registry lint guard for healthCheck.assertFile + allowlist (#660 PR 5) #1223 will add the same
    enforcement at lint time.

Validator runtime

  • Dropped the && len(ref.ExpectedResources) == 0 gate in
    validators/deployment/expected_resources.go. Both paths now run
    per component. Output is source-tagged [expectedResources] /
    [chainsaw].
  • Hard-removed ChainsawBinary.Available() and the associated PR
    feat(recipe): hydrate healthCheck.assertFile + suppression sentinel #1231 skip path (per pre-PR decision). The binary is now a hard
    requirement of the deployment image; a regression that drops it
    surfaces as a clear RunTest error rather than a silent skip.
  • pkg/recipe.hydrateHealthCheckAsserts no longer skips when
    ExpectedResources is set. The transitional skip added in fix(validator): address PR #1231 review on chainsaw hydration runtime #1234 is
    reverted in lockstep with the validator-side gate drop so the
    artifact matches what runs.

GPU deep-check migration (deferred)

  • clusterPolicyReady (issue enhance(validator): add targeted post-deployment GPU readiness checks #611 deep check) migration to
    registry-declared Chainsaw YAML is deferred to a focused
    follow-up. The migration must prove assertion equivalence against
    expected_resources_test.go's heavily-mocked ClusterPolicy state
    coverage; bundling that semantic argument here would bloat review.
    Disposition documented in the new doc on verifyGPUReadinessSignals.
  • verifyNodewrightReady (formerly skyhookReady) stays in Go:
    names are derived dynamically from recipe ManifestFiles at
    validate-time. Static Chainsaw YAML cannot express the
    dynamic-name selector.
  • verifyDRAKubeletPluginReady stays in Go: chart's full DaemonSet
    name is release-derived. Encoding it would violate the
    deployer-neutrality constraint ([EPIC] Reuse Chainsaw health checks in aicr validate --phase deployment #660 issue body).

Testing

go test -race -count=1 ./pkg/recipe/ ./pkg/defaults/ ./pkg/validator/catalog/ ./validators/...
golangci-lint run -c .golangci.yaml ./pkg/recipe/... ./pkg/defaults/... ./pkg/validator/catalog/... ./validators/...

New tests:

  • pkg/defaults.TestJobEnvelopeMarginInvariant — guards the
    envelope contract.
  • pkg/validator/catalog.TestExpectedResourcesCatalogEnvelope
    guards the catalog timeout is >= ChainsawAssertTimeout + JobEnvelopeMargin.
  • validators/chainsaw.TestValidateTestReadOnly (9 sub-cases) +
    TestValidateTestReadOnly_RegistryContent (sweeps in-tree checks).
  • validators/chainsaw.TestNewChainsawBinary (PATH hit + miss).
  • pkg/recipe.TestHydrateHealthCheckAsserts_RunsAlongsideExpectedResources
    — replaces fix(validator): address PR #1231 review on chainsaw hydration runtime #1234's transitional _ExpectedResourcesWins test.

Coverage delta on changed packages:

  • validators/chainsaw: 14.4% → 22.1% (+7.7%).
  • pkg/defaults: 100% (unchanged).
  • pkg/validator/catalog: 97.5%.
  • validators/deployment: 44.1%.

Live-cluster validation required by #660 acceptance criteria —
deferred to @mchmarny per out-of-band agreement (no live cluster in
PR author's environment).

Risk Assessment

  • Medium — Touches the validator image build, drops a runtime
    gate, and adds a runtime allowlist. Easy to revert per-component
    (Dockerfile, catalog.yaml, runtime branches all isolated).

Rollout notes: No CLI-flag changes. Output gains source-tag
prefixes ([expectedResources] / [chainsaw]). Resolved
recipe.yaml artifacts for components with both assertFile AND
ExpectedResources now carry the hydrated chainsaw content
(previously suppressed transitionally). Recipe authors who declared
expectedResources to suppress hydrated content must add
healthCheckSkip: true instead (existing field from #1219). The
chainsaw binary is now required in the deployment validator image —
operators running custom images must rebuild.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (changed paths)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs for user-facing behavior changes
    (docs/contributor/validator.md)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.2%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.2%25-green)

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/defaults 100.00% (ø)
github.com/NVIDIA/aicr/pkg/recipe 86.16% (-0.01%) 👎
github.com/NVIDIA/aicr/validators/chainsaw 0.00% (ø)
github.com/NVIDIA/aicr/validators/deployment 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/defaults/timeouts.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/recipe/metadata_store.go 86.13% (-0.07%) 382 (-2) 329 (-2) 53 👎
github.com/NVIDIA/aicr/validators/chainsaw/allowlist.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/chainsaw/binary.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/chainsaw/runner.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/validators/deployment/expected_resources.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny force-pushed the feat/1220-chainsaw-deployment-runner branch from 9579659 to c6e3949 Compare June 8, 2026 22:42
yuanchen8911

This comment was marked as resolved.

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments. PTAL.

coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny force-pushed the feat/1220-chainsaw-deployment-runner branch 2 times, most recently from 601c9a8 to 25af1bc Compare June 8, 2026 23:02
coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny force-pushed the feat/1220-chainsaw-deployment-runner branch from 4d802f6 to 0ff057f Compare June 8, 2026 23:35
@mchmarny mchmarny requested a review from yuanchen8911 June 8, 2026 23:41

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge-blocking CI build-arg gaps remain, and the Chainsaw read-only guard still misses top-level/future catch surfaces. README timeout drift is minor.

@yuanchen8911

Copy link
Copy Markdown
Contributor

The three GPU check failures look like an infrastructure/runner flake. Re-running the GPU jobs should clear it.

@mchmarny

mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Tracking the GPU CI flake separately in #1237 — environmental, all 3 failing GPU jobs hit the same nvidia-container-toolkit 1.19.1-1 + containerd-restart fingerprint inside the kind worker. Not in #1235's blast radius (failure occurs in nvkind cluster create's internal apt-install on the worker, before any AICR validator code runs). Same trace appears on main since ~2026-06-07 19:00 UTC.

Live-cluster validation by @njhensley is the load-bearing GPU signal for this PR; the GPU-CI failures don't block that.

yuanchen8911
yuanchen8911 previously approved these changes Jun 9, 2026

@yuanchen8911 yuanchen8911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the issues are fully resolved. Thanks. /lgtm

The only remaining red checks are the three GPU tests (Inference/Smoke/Training). I re-triggered the failed jobs — hopefully they pass this time.

PR #1220 of epic #660. Activates the dormant Chainsaw runner so the
HealthCheckAsserts content hydrated by #1219 / #1234 actually executes
during aicr validate --phase deployment.

Image
-----

- validators/deployment/Dockerfile: new chainsaw-fetch multi-stage
  downloads the pinned chainsaw release tarball, verifies sha256
  against the value in .settings.yaml, and COPYs the binary into the
  distroless final stage at /usr/local/bin/chainsaw. Single-arch
  (linux/amd64); the linux_arm64 checksum is already carried in
  .settings.yaml so the multi-arch upgrade path is clear.
- Makefile image-validators target reads CHAINSAW_VERSION and
  CHAINSAW_SHA256_LINUX_AMD64 from .settings.yaml via yq and passes
  them as --build-arg only to the deployment phase build (conformance
  and performance images do not use chainsaw).
- go.mod's kyverno/chainsaw v0.2.15 is already in lockstep with the
  .settings.yaml pin.

Envelope
--------

- defaults.JobEnvelopeMargin (60s) added on top of
  defaults.ChainsawAssertTimeout (6m). The catalog timeout for the
  expected-resources validator becomes the Job's activeDeadlineSeconds
  — both must exceed the chainsaw inner timeout so the binary has
  headroom to terminate, clean up its temp dir, and flush logs before
  SIGKILL.
- recipes/validators/catalog.yaml's expected-resources timeout bumped
  5m → 7m. TestExpectedResourcesCatalogEnvelope in
  pkg/validator/catalog asserts the invariant.

Read-only allowlist
-------------------

- validators/chainsaw/ValidateTestReadOnly parses each registry-
  declared Chainsaw Test YAML and rejects any operation outside the
  {assert, error} allowlist. Bounds the cluster-admin RBAC posture of
  the deployment validator Job: registry content cannot invoke
  state-changing (apply/create/delete/patch/update) or side-effecting
  (script/command/wait/sleep/podLogs/events/describe/get) operations.
- Catch / finally / cleanup blocks are equivalently restricted; only
  assert/error appear in any pure read-only Chainsaw Test.
- Test sweeps every recipes/checks/*/health-check.yaml to verify
  compliance. PR #1223 will add the same enforcement at lint time.

Validator runtime
-----------------

- Drop the &&len(ref.ExpectedResources)==0 gate in
  validators/deployment/expected_resources.go. Both
  ExpectedResources and HealthCheckAsserts paths now run per
  component. CLI output is source-tagged [expectedResources] /
  [chainsaw] so operators can disambiguate when both paths report on
  the same component.
- Hard-remove ChainsawBinary.Available() and the associated PR #1231
  skip path. The binary is now a hard requirement of the deployment
  validator image; a regression that drops it surfaces as a clear
  RunTest error rather than a silent skip.
- pkg/recipe.hydrateHealthCheckAsserts no longer skips when
  ExpectedResources is set — the transitional skip added in #1234 is
  reverted in lockstep with the validator-side gate drop, so the
  artifact matches what runs.

GPU deep-check migration (deferred)
-----------------------------------

- clusterPolicyReady migration to registry-declared Chainsaw YAML is
  deferred to a focused follow-up. The migration must prove assertion
  equivalence against expected_resources_test.go's heavily-mocked
  ClusterPolicy state coverage; bundling that semantic argument here
  would bloat review. Documented in verifyGPUReadinessSignals doc.
- verifyNodewrightReady (formerly skyhookReady) and
  verifyDRAKubeletPluginReady stay in Go: both rely on dynamic names
  (recipe ManifestFiles / release-derived chart fullname) that static
  Chainsaw YAML cannot express without a chart-shape label upstream
  does not currently apply. Issue #660's deployer-neutrality
  constraint prohibits encoding the release-derived form.

Docs
----

docs/contributor/validator.md's chainsaw section now explains the
two-surface model (make check-health vs aicr validate), the runtime
binary path, source-tagged CLI output, and the read-only allowlist.

Live-cluster validation
-----------------------

Required by #660 acceptance criteria; deferred to mchmarny per
out-of-band agreement (no live cluster in PR author's environment).

Closes #1220. Refs #660, #1219, #1234.
@mchmarny mchmarny force-pushed the feat/1220-chainsaw-deployment-runner branch from 01091a0 to d7e68a2 Compare June 9, 2026 09:20
@mchmarny mchmarny enabled auto-merge (squash) June 9, 2026 09:22
@mchmarny mchmarny disabled auto-merge June 9, 2026 09:26
@mchmarny mchmarny merged commit 25a6cdd into main Jun 9, 2026
116 of 119 checks passed
@mchmarny mchmarny deleted the feat/1220-chainsaw-deployment-runner branch June 9, 2026 09:33
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
mchmarny added a commit that referenced this pull request Jun 9, 2026
Closes #1222. With this merged, epic #660 is fully done.

The PR-time runtime contract from #1219/#1220/#1223 is in place; this
PR closes the depth-of-coverage gap on the 21 health checks that
landed before the epic and were shallow by today's standards. Per the
audit in the PR body, three classes of enhancement:

1. DaemonSet partial-readiness fix (2 files):
   aws-efa, nvidia-dra-driver-gpu were gating on `numberReady > 0`,
   which passed on partial failures. Switched to
   `numberReady == desiredNumberScheduled` with
   `desiredNumberScheduled > 0` as the vacuous-pass guard — same
   pattern that fixed NFD in #1234 and the new gke-nccl-tcpxo in
   #1243.

2. Owned-CRD Established=True assertions (4 components):
   - gpu-operator: clusterpolicies.nvidia.com + the singleton
     `cluster-policy` CR with status.state == "ready". This
     MIGRATES the verifyClusterPolicyReady Go check that PR #1235
     task 7 deferred for assertion-equivalence proof; the static
     Chainsaw assertion is now the canonical readiness signal.
   - dynamo-platform: dynamocomponentdeployments.nvidia.com.
   - nodewright-operator: skyhooks.skyhook.nvidia.com.
   - cert-manager: certificates / issuers / clusterissuers
     .cert-manager.io.
   CRD names verified against
   tests/chainsaw/ai-conformance/cluster/assert-crds.yaml and the
   Go verifyClusterPolicyReady check.

3. cert-manager scope fix:
   Previously asserted only the controller Deployment, missing
   cainjector + webhook. The webhook in particular is load-bearing
   for admission of any Certificate/Issuer CR; a healthy controller
   with a dead webhook would silently pass while the cluster rejects
   every cert-manager CR. Now asserts all three Deployments.

4. Universal container-state error coverage (21 files):
   Every check that previously only gated on pod phase (Pending /
   Failed / Unknown) now also rejects pods stuck in Running phase
   with a known-bad container-waiting reason (CrashLoopBackOff,
   ImagePullBackOff, ErrImagePull, CreateContainerConfigError). The
   phase-only filter missed CrashLoopBackOff entirely because such
   pods typically remain in Running phase with unhealthy
   containers. Same pattern NFD adopted via #1234. Label-scoped in
   shared namespaces (kube-system, monitoring) to avoid sibling
   false-matches; namespace-only otherwise.

Also addresses two post-merge review comments on PR #1244:

- validators/chainsaw/allowlist_test.go: provider prefix changed
  from "." to "" to match the canonical defaultEmbeddedProvider
  used at runtime (head-scratch elimination, no behavior change —
  filepath.Join cleans both forms equivalently).
- Same file: added an explicit CROSS-TEST DEPENDENCY comment block
  noting that empty-assertFile and unreadable-path failure modes
  are owned by pkg/recipe.TestComponentRegistry_RequiresHealthCheck.
  If that test is ever removed or weakened, this test must be
  strengthened in lockstep so a broken registry can't silently
  green-light here.

Out of scope (deferred to follow-up work):

- CRD Established=True for the remaining 6 operators that own CRDs
  but where I lack high-confidence enumeration without pulling the
  charts: gatekeeper, kueue, kubeflow-trainer, network-operator,
  k8s-nim-operator, slinky-slurm-operator. Each needs verification
  against its pinned chart version; bundling them blindly would
  risk introducing typo'd names that fail on live clusters.
  Tracking as a possible follow-up after this PR proves the
  audit-and-enhance pattern works end-to-end.
- StatefulSet readiness for kube-prometheus-stack's Prometheus /
  Alertmanager: similar reasoning — the chart's exact StatefulSet
  names and label conventions need verification.

Allowlist sweep (TestValidateTestReadOnly_RegistryContent) +
registry lint guard (TestComponentRegistry_RequiresHealthCheck)
both pass: 27/27 components valid. Live-cluster validation
deferred to mchmarny per the same agreement as #1235 / #1243.

Refs #660 (epic), #1219, #1220, #1221, #1223, #1234.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ship chainsaw binary + wire deployment-phase runner (#660 PR 2)

3 participants