feat(validators): replace chainsaw binary with in-process executor#1252
Conversation
Implements issue #1236. The deployment validator no longer exec's /usr/local/bin/chainsaw to run registry-declared health checks; the allowlist already restricts content to read-only `assert` / `error` operations, which is exactly the subset kyverno-json's `checks.Check` engine implements. - validators/chainsaw/inprocess.go: pure-Go executor that unmarshals a chainsaw.kyverno.io/v1alpha1 Test, walks spec.steps[].try[], and dispatches assert/error to checks.Check. Honors per-test spec.timeouts.assert override. - validators/chainsaw/fetch.go: clusterFetcher gains List() via the dynamic client + label-selector composition. - validators/chainsaw/binary.go + binary_test.go: deleted. - validators/chainsaw/runner.go: drops runConfig / RunOption / WithChainsawBinary; package doc rewritten for the in-process model. - validators/chainsaw/inprocess_test.go: parity against the 27 in-tree health-check YAMLs; happy / assert-fails / error-fires unit cases. - validators/deployment/Dockerfile: drops the chainsaw-fetch multi-stage; final image no longer ships the chainsaw binary. - Makefile + .github/workflows/{on-push,on-tag,uat-aws,uat-gcp, vuln-scan-images}.yaml: drop CHAINSAW_VERSION / CHAINSAW_SHA256_* build-args wired into the deployment image. build-gate still consumes them for its own chainsaw runtime — unchanged. - docs/contributor/validator.md: documents the in-process executor.
Coverage Report ✅
Coverage BadgeCoverage unchanged by this PR. |
This comment was marked as resolved.
This comment was marked as resolved.
- inprocess.go: replace remaining fmt.Errorf returns with pkg/errors (ErrCodeInternal / ErrCodeNotFound as appropriate) so error codes flow through wrap chains instead of being flattened to internal at the top of runChainsawTestInProcess. - inprocess.go: add ctx.Err() checks at the top of the non-retry loops in runChainsawTestInProcess, executeStepInProcess, and the list-iteration paths of evaluateAssert / evaluateError so cancellation short-circuits even when no retry select runs. - inprocess_test.go: convert the three scenario tests into a single table-driven TestRunChainsawTestInProcess and add a fourth case exercising metadata.labels selector filtering. - inprocess_test.go: fakeFetcher.List now honors the label selector (compares against metadata.labels on each item) so selector-path regressions are catchable. - inprocess_test.go: corpus parity test now fails on any non-StructuredError surfaced by the executor (previously silently ignored), so unexpected error types can't mask regressions. - docs/contributor/validator.md: clarify the runtime split between the two chainsaw surfaces — `make check-health` shells out to the developer-installed chainsaw CLI, while `aicr validate --phase deployment` executes in-process via inprocess.go.
- fetch.go: distinguish a true 404 from any other apiserver failure. Previously every Get error was wrapped as ErrCodeNotFound, so a transient 5xx / timeout / forbidden was indistinguishable from the resource being absent — a chainsaw `error:` block then treated it as the happy path and silently passed (fails open). Use apierrors.IsNotFound to gate; non-404 errors now wrap as ErrCodeUnavailable so negative health checks fail closed, matching the chainsaw binary's prior behavior. - inprocess.go: evaluateError name-branch only short-circuits to nil on a true NotFound (checked via stderrors.Is against the StructuredError code); other errors propagate so the check fails. - inprocess.go: stop double-wrapping structured errors from fetcher.List / fetcher.Fetch in evaluateAssert and evaluateError — the inner errors already carry the correct code, and re-wrapping with ErrCodeInternal flattens ErrCodeNotFound / ErrCodeUnavailable / ErrCodeInvalidRequest at the API surface. Per CLAUDE.md's "don't double-wrap errors that already have proper codes". - inprocess.go: also remove the executeStepInProcess and runChainsawTestInProcess outer wraps that were rewriting the same inner code. Step / component context is captured in the slog line on the failure path. - inprocess.go: cap the whole Test under a single context.WithTimeout(ctx, effectiveTimeout) so an N-step Test can no longer run N × effectiveTimeout in the unhealthy / retrying case — restores the outer cap the chainsaw binary path had via ChainsawAssertTimeout. - inprocess.go: fix the executeStepInProcess doc — ops in a step share one deadline (computed at step entry), not a per-op clock, and only timeouts.assert is read (timeouts.error is ignored; none of the in-tree checks set it). - fetch_test.go (new): direct unit coverage for clusterFetcher against k8s.io/client-go/dynamic/fake. Cases: namespaced + cluster- scoped Get; NotFound, invalid apiVersion, unknown kind error codes; transient apiserver 503 → ErrCodeUnavailable (the failing-closed regression guard for the chainsaw `error:` happy path); namespaced List with no selector, with single-key selector, with multi-key selector; cluster-scoped List; List error-code paths. - go.mod / vendor: add k8s.io/client-go/dynamic/fake for the tests.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
validators/chainsaw/inprocess.go (1)
165-173:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFail fast on non-retryable
ErrCodeInvalidRequesterrors.Both retry loops keep retrying every failure until deadline. If
evaluateAssert/evaluateErrorreturnsErrCodeInvalidRequest(bad Test spec), the executor burns the full timeout instead of failing immediately.Suggested fix
func runAssertWithRetry(ctx context.Context, a *v1alpha1.Assert, fetcher ResourceFetcher, deadline time.Time) error { var lastErr error for { lastErr = evaluateAssert(ctx, a, fetcher) if lastErr == nil { return nil } + var se *errors.StructuredError + if stderrors.As(lastErr, &se) && se.Code == errors.ErrCodeInvalidRequest { + return lastErr + } remaining := time.Until(deadline) if remaining <= 0 { return lastErr } @@ func runErrorWithRetry(ctx context.Context, e *v1alpha1.Error, fetcher ResourceFetcher, deadline time.Time) error { var lastErr error for { lastErr = evaluateError(ctx, e, fetcher) if lastErr == nil { return nil } + var se *errors.StructuredError + if stderrors.As(lastErr, &se) && se.Code == errors.ErrCodeInvalidRequest { + return lastErr + } remaining := time.Until(deadline) if remaining <= 0 { return lastErr }Also applies to: 193-200
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/chainsaw/inprocess.go` around lines 165 - 173, The retry loops that call evaluateAssert and evaluateError must fail fast when the returned error indicates a non-retryable bad test spec (ErrCodeInvalidRequest); update both loops (the one around evaluateAssert and the one around evaluateError) to check the returned error immediately after the call and if it represents ErrCodeInvalidRequest (use the existing error sentinel/type check your codebase uses, e.g. errors.Is(lastErr, ErrCodeInvalidRequest) or inspecting the error code) return that error immediately instead of continuing to retry until the deadline.validators/chainsaw/fetch.go (1)
115-119:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
ErrCodeUnavailablefor apiserver List failures.At Line 117,
resource.Listtransport/apiserver failures are wrapped asErrCodeInternal, while the analogousFetchpath already classifies non-404 API failures asErrCodeUnavailable(Lines 72-73). This can misclassify transient cluster outages upstream.Suggested fix
- return nil, errors.Wrap(errors.ErrCodeInternal, + return nil, errors.Wrap(errors.ErrCodeUnavailable, fmt.Sprintf("failed to list %s in namespace %q", gvk, namespace), err)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/chainsaw/fetch.go` around lines 115 - 119, The List call error is being wrapped with errors.ErrCodeInternal; change it to use errors.ErrCodeUnavailable for apiserver/transport failures to match the Fetch path behavior. Update the error wrapping around resource.List(ctx, opts) so that failures return errors.Wrap(errors.ErrCodeUnavailable, ...) instead of ErrCodeInternal, keeping the same message ("failed to list %s in namespace %q") and preserving the wrapped err; reference symbols: resource.List, ErrCodeInternal, ErrCodeUnavailable, Fetch, gvk, namespace.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@validators/chainsaw/fetch.go`:
- Around line 115-119: The List call error is being wrapped with
errors.ErrCodeInternal; change it to use errors.ErrCodeUnavailable for
apiserver/transport failures to match the Fetch path behavior. Update the error
wrapping around resource.List(ctx, opts) so that failures return
errors.Wrap(errors.ErrCodeUnavailable, ...) instead of ErrCodeInternal, keeping
the same message ("failed to list %s in namespace %q") and preserving the
wrapped err; reference symbols: resource.List, ErrCodeInternal,
ErrCodeUnavailable, Fetch, gvk, namespace.
In `@validators/chainsaw/inprocess.go`:
- Around line 165-173: The retry loops that call evaluateAssert and
evaluateError must fail fast when the returned error indicates a non-retryable
bad test spec (ErrCodeInvalidRequest); update both loops (the one around
evaluateAssert and the one around evaluateError) to check the returned error
immediately after the call and if it represents ErrCodeInvalidRequest (use the
existing error sentinel/type check your codebase uses, e.g. errors.Is(lastErr,
ErrCodeInvalidRequest) or inspecting the error code) return that error
immediately instead of continuing to retry until the deadline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 9262e585-929c-4279-9875-1494ceaa6129
⛔ Files ignored due to path filters (2)
vendor/k8s.io/client-go/dynamic/fake/simple.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**
📒 Files selected for processing (3)
validators/chainsaw/fetch.govalidators/chainsaw/fetch_test.govalidators/chainsaw/inprocess.go
ArangoGutierrez
left a comment
There was a problem hiding this comment.
Re-reviewed at 8e1e7143ca. All five points are addressed and the fixes hold up:
- Named
errorfail-open —Fetchnow splitsapierrors.IsNotFound→ErrCodeNotFoundfrom everything else →ErrCodeUnavailable, andevaluateErroronly treats a true NotFound as the happy path (stderrors.Is), otherwise it propagates and retries. Fails closed on transient apiserver errors now, matching the old binary. - Timeout cap —
context.WithTimeout(ctx, effectiveTimeout)bounds the whole Test, so the N × effectiveTimeout blowup in the retrying case is gone. - Error-code wrapping — the
ErrCodeInternalwraps around the fetcher calls are all gone; codes propagate untouched throughexecuteStepInProcessandrunChainsawTestInProcess. Went a bit further than I'd flagged. Listcoverage —fetch_test.gois solid: table-driven over namespaced + cluster-scoped, label filtering, and the error-code mappings, plus a dedicated reactor test thatServiceUnavailablemaps toErrCodeUnavailable— that one pins the fail-closed fix directly. Real assertions against the real dynamic/fake client, no theater.- Per-op comment — now accurately describes the shared per-step deadline and the ignored
timeouts.error.
One non-blocking leftover: pkg/defaults/timeouts.go still describes ChainsawAssertTimeout as the outer timeout for the chainsaw binary process that must exceed the 5m internal assert timeout. There's no binary now, and the outer cap is effectiveTimeout — which for every in-tree check is timeouts.assert (5m), so the 6m constant only kicks in as a fallback when a Test omits timeouts.assert. Net budget for a multi-step Test is ~5m vs the old ~6m, fine for these checks. Worth a one-line comment refresh next time that file is touched.
LGTM with that noted. Good turnaround.
Unconditionally capture /etc/containerd/config.toml and /etc/containerd/conf.d/99-nvidia.toml from every kind node container into the debug artifact tarball, even when COLLECT_NODE_RUNTIME_ARTIFACTS=false (the GPU jobs' current default). Cheap (two small files per node) and targets the nvidia-container-toolkit 1.19.1 vs kind worker containerd 2.3.1 schema-drift hypothesis under investigation in #1237 — the failing restart writes the latter file via nvidia-ctk, and we currently have no visibility into its contents post-mortem.
…runner The constant is no longer the outer cap for an external chainsaw binary process — #1236 replaced that path with the in-process runner, which caps each Test under context.WithTimeout(ctx, effectiveTimeout) where effectiveTimeout is the YAML's spec.timeouts.assert if set, otherwise this 6m fallback. Note also that every in-tree check sets timeouts.assert, so this default only applies to Tests that omit it.
|
Thanks for the thorough re-review, @ArangoGutierrez — picked up the Holding off on merging for one more cycle — added a small unconditional dump of |
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.
Summary
The deployment validator no longer exec's
/usr/local/bin/chainsawto run registry-declared health checks. A pure-Go executor walks the samechainsaw.kyverno.io/v1alpha1Test AST and dispatchesassert/errorto kyverno-json'schecks.Checkengine — the exact subset the read-only allowlist already restricts content to.Motivation / Context
The deployment validator image shipped a pinned chainsaw binary and a Dockerfile
chainsaw-fetchmulti-stage just to invoke two operations (assert/error) that the in-process kyverno-json engine already implements. Dropping the binary eliminates a CHAINSAW_* build-arg matrix across five workflows, removes a release-tarball download from the image build, and shrinks the validator's supply-chain footprint to a single static Go binary.Fixes: #1236
Related: #1219, #1220, #1222, #1223
Type of Change
Component(s) Affected
pkg/validator)Implementation Notes
validators/chainsaw/inprocess.go: unmarshals a v1alpha1 Test, walksspec.steps[].try[], honors per-testspec.timeouts.assert, retries until deadline, dispatchesAssert/Errortochecks.Checkwith the project'sapis.DefaultCompilers.validators/chainsaw/fetch.go:clusterFetchergainsList()via the dynamic client + label-selector composition soErroroperations with label selectors work.validators/chainsaw/runner.go: package doc rewritten for the in-process model;runConfig/RunOption/WithChainsawBinaryremoved.validators/chainsaw/binary.go+binary_test.go: deleted.validators/chainsaw/inprocess_test.go: parity sweep across all 27 in-treerecipes/checks/*/health-check.yamlfiles plus unit cases for happy / assert-fails / error-fires.validators/deployment/Dockerfile: drops thechainsaw-fetchmulti-stage.Makefile+ five workflows: dropCHAINSAW_VERSION/CHAINSAW_SHA256_*build-args.cmd/gate/Dockerfileand.github/actions/e2estill consume them for their own chainsaw runtime — unchanged.docs/contributor/validator.md: documents the in-process executor.Testing
make qualify # passedParity tests in
validators/chainsaw/inprocess_test.gocover all 27 in-tree health-check YAMLs (each wrapped in a 200ms context to short-circuit retry loops without a live cluster).Risk Assessment
Rollout notes: No CLI / API surface change. Allowlist already restricted registry content to assert/error before this PR, so existing
recipes/checks/*/health-check.yamlcontent is byte-for-byte compatible with the in-process executor. The build-gate image still bundles chainsaw for its own runtime (separate code path).Checklist
make testwith-race)make lint)git commit -S)