Skip to content

feat(validators): replace chainsaw binary with in-process executor#1252

Merged
mchmarny merged 6 commits into
mainfrom
feat/1236-pure-go-chainsaw
Jun 9, 2026
Merged

feat(validators): replace chainsaw binary with in-process executor#1252
mchmarny merged 6 commits into
mainfrom
feat/1236-pure-go-chainsaw

Conversation

@mchmarny

@mchmarny mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

The deployment validator no longer exec's /usr/local/bin/chainsaw to run registry-declared health checks. A pure-Go executor walks the same chainsaw.kyverno.io/v1alpha1 Test AST and dispatches assert / error to kyverno-json's checks.Check engine — the exact subset the read-only allowlist already restricts content to.

Motivation / Context

The deployment validator image shipped a pinned chainsaw binary and a Dockerfile chainsaw-fetch multi-stage just to invoke two operations (assert / error) that the in-process kyverno-json engine already implements. Dropping the binary eliminates a CHAINSAW_* build-arg matrix across five workflows, removes a release-tarball download from the image build, and shrinks the validator's supply-chain footprint to a single static Go binary.

Fixes: #1236
Related: #1219, #1220, #1222, #1223

Type of Change

  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Validator (pkg/validator)
  • Build/CI

Implementation Notes

  • validators/chainsaw/inprocess.go: unmarshals a v1alpha1 Test, walks spec.steps[].try[], honors per-test spec.timeouts.assert, retries until deadline, dispatches Assert / Error to checks.Check with the project's apis.DefaultCompilers.
  • validators/chainsaw/fetch.go: clusterFetcher gains List() via the dynamic client + label-selector composition so Error operations with label selectors work.
  • validators/chainsaw/runner.go: package doc rewritten for the in-process model; runConfig / RunOption / WithChainsawBinary removed.
  • validators/chainsaw/binary.go + binary_test.go: deleted.
  • validators/chainsaw/inprocess_test.go: parity sweep across all 27 in-tree recipes/checks/*/health-check.yaml files plus unit cases for happy / assert-fails / error-fires.
  • validators/deployment/Dockerfile: drops the chainsaw-fetch multi-stage.
  • Makefile + five workflows: drop CHAINSAW_VERSION / CHAINSAW_SHA256_* build-args. cmd/gate/Dockerfile and .github/actions/e2e still consume them for their own chainsaw runtime — unchanged.
  • docs/contributor/validator.md: documents the in-process executor.

Testing

make qualify  # passed

Parity tests in validators/chainsaw/inprocess_test.go cover all 27 in-tree health-check YAMLs (each wrapped in a 200ms context to short-circuit retry loops without a live cluster).

Risk Assessment

  • Medium — Touches the deployment validator's execution model; image build matrix changes ripple across 5 workflows.

Rollout notes: No CLI / API surface change. Allowlist already restricted registry content to assert/error before this PR, so existing recipes/checks/*/health-check.yaml content is byte-for-byte compatible with the in-process executor. The build-gate image still bundles chainsaw for its own runtime (separate code path).

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Implements issue #1236. The deployment validator no longer exec's
/usr/local/bin/chainsaw to run registry-declared health checks; the
allowlist already restricts content to read-only `assert` / `error`
operations, which is exactly the subset kyverno-json's
`checks.Check` engine implements.

- validators/chainsaw/inprocess.go: pure-Go executor that unmarshals
  a chainsaw.kyverno.io/v1alpha1 Test, walks spec.steps[].try[],
  and dispatches assert/error to checks.Check. Honors per-test
  spec.timeouts.assert override.
- validators/chainsaw/fetch.go: clusterFetcher gains List() via the
  dynamic client + label-selector composition.
- validators/chainsaw/binary.go + binary_test.go: deleted.
- validators/chainsaw/runner.go: drops runConfig / RunOption /
  WithChainsawBinary; package doc rewritten for the in-process model.
- validators/chainsaw/inprocess_test.go: parity against the 27 in-tree
  health-check YAMLs; happy / assert-fails / error-fires unit cases.
- validators/deployment/Dockerfile: drops the chainsaw-fetch
  multi-stage; final image no longer ships the chainsaw binary.
- Makefile + .github/workflows/{on-push,on-tag,uat-aws,uat-gcp,
  vuln-scan-images}.yaml: drop CHAINSAW_VERSION / CHAINSAW_SHA256_*
  build-args wired into the deployment image. build-gate still
  consumes them for its own chainsaw runtime — unchanged.
- docs/contributor/validator.md: documents the in-process executor.
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 76.3%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

Coverage unchanged by this PR.

@coderabbitai

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

mchmarny and others added 2 commits June 9, 2026 06:43
- inprocess.go: replace remaining fmt.Errorf returns with pkg/errors
  (ErrCodeInternal / ErrCodeNotFound as appropriate) so error codes
  flow through wrap chains instead of being flattened to internal at
  the top of runChainsawTestInProcess.
- inprocess.go: add ctx.Err() checks at the top of the non-retry
  loops in runChainsawTestInProcess, executeStepInProcess, and the
  list-iteration paths of evaluateAssert / evaluateError so
  cancellation short-circuits even when no retry select runs.
- inprocess_test.go: convert the three scenario tests into a single
  table-driven TestRunChainsawTestInProcess and add a fourth case
  exercising metadata.labels selector filtering.
- inprocess_test.go: fakeFetcher.List now honors the label selector
  (compares against metadata.labels on each item) so selector-path
  regressions are catchable.
- inprocess_test.go: corpus parity test now fails on any
  non-StructuredError surfaced by the executor (previously silently
  ignored), so unexpected error types can't mask regressions.
- docs/contributor/validator.md: clarify the runtime split between
  the two chainsaw surfaces — `make check-health` shells out to the
  developer-installed chainsaw CLI, while `aicr validate --phase
  deployment` executes in-process via inprocess.go.
ArangoGutierrez

This comment was marked as resolved.

- fetch.go: distinguish a true 404 from any other apiserver failure.
  Previously every Get error was wrapped as ErrCodeNotFound, so a
  transient 5xx / timeout / forbidden was indistinguishable from the
  resource being absent — a chainsaw `error:` block then treated it
  as the happy path and silently passed (fails open). Use
  apierrors.IsNotFound to gate; non-404 errors now wrap as
  ErrCodeUnavailable so negative health checks fail closed,
  matching the chainsaw binary's prior behavior.

- inprocess.go: evaluateError name-branch only short-circuits to nil
  on a true NotFound (checked via stderrors.Is against the
  StructuredError code); other errors propagate so the check fails.

- inprocess.go: stop double-wrapping structured errors from
  fetcher.List / fetcher.Fetch in evaluateAssert and evaluateError —
  the inner errors already carry the correct code, and re-wrapping
  with ErrCodeInternal flattens ErrCodeNotFound / ErrCodeUnavailable
  / ErrCodeInvalidRequest at the API surface. Per CLAUDE.md's
  "don't double-wrap errors that already have proper codes".

- inprocess.go: also remove the executeStepInProcess and
  runChainsawTestInProcess outer wraps that were rewriting the same
  inner code. Step / component context is captured in the slog line
  on the failure path.

- inprocess.go: cap the whole Test under a single
  context.WithTimeout(ctx, effectiveTimeout) so an N-step Test can
  no longer run N × effectiveTimeout in the unhealthy / retrying
  case — restores the outer cap the chainsaw binary path had via
  ChainsawAssertTimeout.

- inprocess.go: fix the executeStepInProcess doc — ops in a step
  share one deadline (computed at step entry), not a per-op clock,
  and only timeouts.assert is read (timeouts.error is ignored;
  none of the in-tree checks set it).

- fetch_test.go (new): direct unit coverage for clusterFetcher
  against k8s.io/client-go/dynamic/fake. Cases: namespaced + cluster-
  scoped Get; NotFound, invalid apiVersion, unknown kind error codes;
  transient apiserver 503 → ErrCodeUnavailable (the failing-closed
  regression guard for the chainsaw `error:` happy path); namespaced
  List with no selector, with single-key selector, with multi-key
  selector; cluster-scoped List; List error-code paths.

- go.mod / vendor: add k8s.io/client-go/dynamic/fake for the tests.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
validators/chainsaw/inprocess.go (1)

165-173: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on non-retryable ErrCodeInvalidRequest errors.

Both retry loops keep retrying every failure until deadline. If evaluateAssert/evaluateError returns ErrCodeInvalidRequest (bad Test spec), the executor burns the full timeout instead of failing immediately.

Suggested fix
 func runAssertWithRetry(ctx context.Context, a *v1alpha1.Assert, fetcher ResourceFetcher, deadline time.Time) error {
 	var lastErr error
 	for {
 		lastErr = evaluateAssert(ctx, a, fetcher)
 		if lastErr == nil {
 			return nil
 		}
+		var se *errors.StructuredError
+		if stderrors.As(lastErr, &se) && se.Code == errors.ErrCodeInvalidRequest {
+			return lastErr
+		}
 		remaining := time.Until(deadline)
 		if remaining <= 0 {
 			return lastErr
 		}
@@
 func runErrorWithRetry(ctx context.Context, e *v1alpha1.Error, fetcher ResourceFetcher, deadline time.Time) error {
 	var lastErr error
 	for {
 		lastErr = evaluateError(ctx, e, fetcher)
 		if lastErr == nil {
 			return nil
 		}
+		var se *errors.StructuredError
+		if stderrors.As(lastErr, &se) && se.Code == errors.ErrCodeInvalidRequest {
+			return lastErr
+		}
 		remaining := time.Until(deadline)
 		if remaining <= 0 {
 			return lastErr
 		}

Also applies to: 193-200

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/chainsaw/inprocess.go` around lines 165 - 173, The retry loops
that call evaluateAssert and evaluateError must fail fast when the returned
error indicates a non-retryable bad test spec (ErrCodeInvalidRequest); update
both loops (the one around evaluateAssert and the one around evaluateError) to
check the returned error immediately after the call and if it represents
ErrCodeInvalidRequest (use the existing error sentinel/type check your codebase
uses, e.g. errors.Is(lastErr, ErrCodeInvalidRequest) or inspecting the error
code) return that error immediately instead of continuing to retry until the
deadline.
validators/chainsaw/fetch.go (1)

115-119: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use ErrCodeUnavailable for apiserver List failures.

At Line 117, resource.List transport/apiserver failures are wrapped as ErrCodeInternal, while the analogous Fetch path already classifies non-404 API failures as ErrCodeUnavailable (Lines 72-73). This can misclassify transient cluster outages upstream.

Suggested fix
-	return nil, errors.Wrap(errors.ErrCodeInternal,
+	return nil, errors.Wrap(errors.ErrCodeUnavailable,
 		fmt.Sprintf("failed to list %s in namespace %q", gvk, namespace), err)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/chainsaw/fetch.go` around lines 115 - 119, The List call error is
being wrapped with errors.ErrCodeInternal; change it to use
errors.ErrCodeUnavailable for apiserver/transport failures to match the Fetch
path behavior. Update the error wrapping around resource.List(ctx, opts) so that
failures return errors.Wrap(errors.ErrCodeUnavailable, ...) instead of
ErrCodeInternal, keeping the same message ("failed to list %s in namespace %q")
and preserving the wrapped err; reference symbols: resource.List,
ErrCodeInternal, ErrCodeUnavailable, Fetch, gvk, namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@validators/chainsaw/fetch.go`:
- Around line 115-119: The List call error is being wrapped with
errors.ErrCodeInternal; change it to use errors.ErrCodeUnavailable for
apiserver/transport failures to match the Fetch path behavior. Update the error
wrapping around resource.List(ctx, opts) so that failures return
errors.Wrap(errors.ErrCodeUnavailable, ...) instead of ErrCodeInternal, keeping
the same message ("failed to list %s in namespace %q") and preserving the
wrapped err; reference symbols: resource.List, ErrCodeInternal,
ErrCodeUnavailable, Fetch, gvk, namespace.

In `@validators/chainsaw/inprocess.go`:
- Around line 165-173: The retry loops that call evaluateAssert and
evaluateError must fail fast when the returned error indicates a non-retryable
bad test spec (ErrCodeInvalidRequest); update both loops (the one around
evaluateAssert and the one around evaluateError) to check the returned error
immediately after the call and if it represents ErrCodeInvalidRequest (use the
existing error sentinel/type check your codebase uses, e.g. errors.Is(lastErr,
ErrCodeInvalidRequest) or inspecting the error code) return that error
immediately instead of continuing to retry until the deadline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9262e585-929c-4279-9875-1494ceaa6129

📥 Commits

Reviewing files that changed from the base of the PR and between c94f051 and 8e1e714.

⛔ Files ignored due to path filters (2)
  • vendor/k8s.io/client-go/dynamic/fake/simple.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
📒 Files selected for processing (3)
  • validators/chainsaw/fetch.go
  • validators/chainsaw/fetch_test.go
  • validators/chainsaw/inprocess.go

@ArangoGutierrez ArangoGutierrez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at 8e1e7143ca. All five points are addressed and the fixes hold up:

  • Named error fail-openFetch now splits apierrors.IsNotFoundErrCodeNotFound from everything else → ErrCodeUnavailable, and evaluateError only treats a true NotFound as the happy path (stderrors.Is), otherwise it propagates and retries. Fails closed on transient apiserver errors now, matching the old binary.
  • Timeout capcontext.WithTimeout(ctx, effectiveTimeout) bounds the whole Test, so the N × effectiveTimeout blowup in the retrying case is gone.
  • Error-code wrapping — the ErrCodeInternal wraps around the fetcher calls are all gone; codes propagate untouched through executeStepInProcess and runChainsawTestInProcess. Went a bit further than I'd flagged.
  • List coveragefetch_test.go is solid: table-driven over namespaced + cluster-scoped, label filtering, and the error-code mappings, plus a dedicated reactor test that ServiceUnavailable maps to ErrCodeUnavailable — that one pins the fail-closed fix directly. Real assertions against the real dynamic/fake client, no theater.
  • Per-op comment — now accurately describes the shared per-step deadline and the ignored timeouts.error.

One non-blocking leftover: pkg/defaults/timeouts.go still describes ChainsawAssertTimeout as the outer timeout for the chainsaw binary process that must exceed the 5m internal assert timeout. There's no binary now, and the outer cap is effectiveTimeout — which for every in-tree check is timeouts.assert (5m), so the 6m constant only kicks in as a fallback when a Test omits timeouts.assert. Net budget for a multi-step Test is ~5m vs the old ~6m, fine for these checks. Worth a one-line comment refresh next time that file is touched.

LGTM with that noted. Good turnaround.

mchmarny added 2 commits June 9, 2026 08:27
Unconditionally capture /etc/containerd/config.toml and
/etc/containerd/conf.d/99-nvidia.toml from every kind node container
into the debug artifact tarball, even when
COLLECT_NODE_RUNTIME_ARTIFACTS=false (the GPU jobs' current default).
Cheap (two small files per node) and targets the
nvidia-container-toolkit 1.19.1 vs kind worker containerd 2.3.1
schema-drift hypothesis under investigation in #1237 — the failing
restart writes the latter file via nvidia-ctk, and we currently have
no visibility into its contents post-mortem.
…runner

The constant is no longer the outer cap for an external chainsaw
binary process — #1236 replaced that path with the in-process
runner, which caps each Test under
context.WithTimeout(ctx, effectiveTimeout) where effectiveTimeout
is the YAML's spec.timeouts.assert if set, otherwise this 6m
fallback. Note also that every in-tree check sets timeouts.assert,
so this default only applies to Tests that omit it.
@mchmarny

mchmarny commented Jun 9, 2026

Copy link
Copy Markdown
Member Author

Thanks for the thorough re-review, @ArangoGutierrez — picked up the ChainsawAssertTimeout comment refresh in bb3fd047. Rewrote it to describe its actual role as the fallback per-Test budget when YAML omits spec.timeouts.assert, and called out that #1236 removed the binary the prior comment referenced.

Holding off on merging for one more cycle — added a small unconditional dump of /etc/containerd/config.toml and /etc/containerd/conf.d/99-nvidia.toml to collect-debug-artifacts.sh in 9ecbf064 to capture the toml that nvidia-ctk emits, since the next GPU run will help unblock #1237.

coderabbitai[bot]

This comment was marked as resolved.

@mchmarny mchmarny merged commit f6cb3cd into main Jun 9, 2026
31 of 34 checks passed
@mchmarny mchmarny deleted the feat/1236-pure-go-chainsaw branch June 9, 2026 17:13
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Jun 9, 2026
Deployment-phase health-check validation regressed in the recent
in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252):

1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY
   non-nil error, so a permanent JMESPath evaluation error was retried
   for the entire assert window (6m) instead of failing fast like the
   old chainsaw binary did. Classify assertion-engine errors as
   ErrCodeInvalidRequest (terminal) and short-circuit the retry loops
   on them. Adds a regression test asserting terminal eval errors
   fail fast (< AssertRetryInterval) rather than retrying to deadline.

2. recipes/checks/*: the (init)containerStatuses[?...] | length(@)
   projection throws 'invalid type for: <nil>' on any pod without
   (init) containers (the common case), feeding defect #1. Coalesce to
   an empty array across all 22 affected component health checks.

3. recipes/checks/dynamo-platform: validate-deployment-exists asserted
   a Deployment named 'dynamo-operator', but the chart prefixes it with
   the release name (dynamo-platform-dynamo-operator-controller-manager),
   so the assert never matched and retried to the deadline. Match by the
   stable app.kubernetes.io/name=dynamo-operator label instead.

Verified on an EKS H100 cluster: deployment phase goes from an 8m
timeout (status=other) to PASSED in ~21s; full deployment 4/4 and
conformance 11/11 green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace chainsaw binary with in-process executor for assert/error allowlist subset

3 participants