feat(validators): replace chainsaw binary with in-process executor by mchmarny · Pull Request #1252 · NVIDIA/aicr

mchmarny · 2026-06-09T13:16:31Z

Summary

The deployment validator no longer exec's /usr/local/bin/chainsaw to run registry-declared health checks. A pure-Go executor walks the same chainsaw.kyverno.io/v1alpha1 Test AST and dispatches assert / error to kyverno-json's checks.Check engine — the exact subset the read-only allowlist already restricts content to.

Motivation / Context

The deployment validator image shipped a pinned chainsaw binary and a Dockerfile chainsaw-fetch multi-stage just to invoke two operations (assert / error) that the in-process kyverno-json engine already implements. Dropping the binary eliminates a CHAINSAW_* build-arg matrix across five workflows, removes a release-tarball download from the image build, and shrinks the validator's supply-chain footprint to a single static Go binary.

Fixes: #1236
Related: #1219, #1220, #1222, #1223

Type of Change

Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

Validator (pkg/validator)
Build/CI

Implementation Notes

validators/chainsaw/inprocess.go: unmarshals a v1alpha1 Test, walks spec.steps[].try[], honors per-test spec.timeouts.assert, retries until deadline, dispatches Assert / Error to checks.Check with the project's apis.DefaultCompilers.
validators/chainsaw/fetch.go: clusterFetcher gains List() via the dynamic client + label-selector composition so Error operations with label selectors work.
validators/chainsaw/runner.go: package doc rewritten for the in-process model; runConfig / RunOption / WithChainsawBinary removed.
validators/chainsaw/binary.go + binary_test.go: deleted.
validators/chainsaw/inprocess_test.go: parity sweep across all 27 in-tree recipes/checks/*/health-check.yaml files plus unit cases for happy / assert-fails / error-fires.
validators/deployment/Dockerfile: drops the chainsaw-fetch multi-stage.
Makefile + five workflows: drop CHAINSAW_VERSION / CHAINSAW_SHA256_* build-args. cmd/gate/Dockerfile and .github/actions/e2e still consume them for their own chainsaw runtime — unchanged.
docs/contributor/validator.md: documents the in-process executor.

Testing

make qualify  # passed

Parity tests in validators/chainsaw/inprocess_test.go cover all 27 in-tree health-check YAMLs (each wrapped in a 200ms context to short-circuit retry loops without a live cluster).

Risk Assessment

Medium — Touches the deployment validator's execution model; image build matrix changes ripple across 5 workflows.

Rollout notes: No CLI / API surface change. Allowlist already restricted registry content to assert/error before this PR, so existing recipes/checks/*/health-check.yaml content is byte-for-byte compatible with the in-process executor. The build-gate image still bundles chainsaw for its own runtime (separate code path).

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

Implements issue #1236. The deployment validator no longer exec's /usr/local/bin/chainsaw to run registry-declared health checks; the allowlist already restricts content to read-only `assert` / `error` operations, which is exactly the subset kyverno-json's `checks.Check` engine implements. - validators/chainsaw/inprocess.go: pure-Go executor that unmarshals a chainsaw.kyverno.io/v1alpha1 Test, walks spec.steps[].try[], and dispatches assert/error to checks.Check. Honors per-test spec.timeouts.assert override. - validators/chainsaw/fetch.go: clusterFetcher gains List() via the dynamic client + label-selector composition. - validators/chainsaw/binary.go + binary_test.go: deleted. - validators/chainsaw/runner.go: drops runConfig / RunOption / WithChainsawBinary; package doc rewritten for the in-process model. - validators/chainsaw/inprocess_test.go: parity against the 27 in-tree health-check YAMLs; happy / assert-fails / error-fires unit cases. - validators/deployment/Dockerfile: drops the chainsaw-fetch multi-stage; final image no longer ships the chainsaw binary. - Makefile + .github/workflows/{on-push,on-tag,uat-aws,uat-gcp, vuln-scan-images}.yaml: drop CHAINSAW_VERSION / CHAINSAW_SHA256_* build-args wired into the deployment image. build-gate still consumes them for its own chainsaw runtime — unchanged. - docs/contributor/validator.md: documents the in-process executor.

github-actions · 2026-06-09T13:24:53Z

Coverage Report ✅

Metric	Value
Coverage	76.3%
Threshold	75%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-76.3%25-green)

Coverage unchanged by this PR.

- inprocess.go: replace remaining fmt.Errorf returns with pkg/errors (ErrCodeInternal / ErrCodeNotFound as appropriate) so error codes flow through wrap chains instead of being flattened to internal at the top of runChainsawTestInProcess. - inprocess.go: add ctx.Err() checks at the top of the non-retry loops in runChainsawTestInProcess, executeStepInProcess, and the list-iteration paths of evaluateAssert / evaluateError so cancellation short-circuits even when no retry select runs. - inprocess_test.go: convert the three scenario tests into a single table-driven TestRunChainsawTestInProcess and add a fourth case exercising metadata.labels selector filtering. - inprocess_test.go: fakeFetcher.List now honors the label selector (compares against metadata.labels on each item) so selector-path regressions are catchable. - inprocess_test.go: corpus parity test now fails on any non-StructuredError surfaced by the executor (previously silently ignored), so unexpected error types can't mask regressions. - docs/contributor/validator.md: clarify the runtime split between the two chainsaw surfaces — `make check-health` shells out to the developer-installed chainsaw CLI, while `aicr validate --phase deployment` executes in-process via inprocess.go.

- fetch.go: distinguish a true 404 from any other apiserver failure. Previously every Get error was wrapped as ErrCodeNotFound, so a transient 5xx / timeout / forbidden was indistinguishable from the resource being absent — a chainsaw `error:` block then treated it as the happy path and silently passed (fails open). Use apierrors.IsNotFound to gate; non-404 errors now wrap as ErrCodeUnavailable so negative health checks fail closed, matching the chainsaw binary's prior behavior. - inprocess.go: evaluateError name-branch only short-circuits to nil on a true NotFound (checked via stderrors.Is against the StructuredError code); other errors propagate so the check fails. - inprocess.go: stop double-wrapping structured errors from fetcher.List / fetcher.Fetch in evaluateAssert and evaluateError — the inner errors already carry the correct code, and re-wrapping with ErrCodeInternal flattens ErrCodeNotFound / ErrCodeUnavailable / ErrCodeInvalidRequest at the API surface. Per CLAUDE.md's "don't double-wrap errors that already have proper codes". - inprocess.go: also remove the executeStepInProcess and runChainsawTestInProcess outer wraps that were rewriting the same inner code. Step / component context is captured in the slog line on the failure path. - inprocess.go: cap the whole Test under a single context.WithTimeout(ctx, effectiveTimeout) so an N-step Test can no longer run N × effectiveTimeout in the unhealthy / retrying case — restores the outer cap the chainsaw binary path had via ChainsawAssertTimeout. - inprocess.go: fix the executeStepInProcess doc — ops in a step share one deadline (computed at step entry), not a per-op clock, and only timeouts.assert is read (timeouts.error is ignored; none of the in-tree checks set it). - fetch_test.go (new): direct unit coverage for clusterFetcher against k8s.io/client-go/dynamic/fake. Cases: namespaced + cluster- scoped Get; NotFound, invalid apiVersion, unknown kind error codes; transient apiserver 503 → ErrCodeUnavailable (the failing-closed regression guard for the chainsaw `error:` happy path); namespaced List with no selector, with single-key selector, with multi-key selector; cluster-scoped List; List error-code paths. - go.mod / vendor: add k8s.io/client-go/dynamic/fake for the tests.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

validators/chainsaw/inprocess.go (1)

165-173: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on non-retryable ErrCodeInvalidRequest errors.

Both retry loops keep retrying every failure until deadline. If evaluateAssert/evaluateError returns ErrCodeInvalidRequest (bad Test spec), the executor burns the full timeout instead of failing immediately.

Suggested fix

 func runAssertWithRetry(ctx context.Context, a *v1alpha1.Assert, fetcher ResourceFetcher, deadline time.Time) error {
 	var lastErr error
 	for {
 		lastErr = evaluateAssert(ctx, a, fetcher)
 		if lastErr == nil {
 			return nil
 		}
+		var se *errors.StructuredError
+		if stderrors.As(lastErr, &se) && se.Code == errors.ErrCodeInvalidRequest {
+			return lastErr
+		}
 		remaining := time.Until(deadline)
 		if remaining <= 0 {
 			return lastErr
 		}
@@
 func runErrorWithRetry(ctx context.Context, e *v1alpha1.Error, fetcher ResourceFetcher, deadline time.Time) error {
 	var lastErr error
 	for {
 		lastErr = evaluateError(ctx, e, fetcher)
 		if lastErr == nil {
 			return nil
 		}
+		var se *errors.StructuredError
+		if stderrors.As(lastErr, &se) && se.Code == errors.ErrCodeInvalidRequest {
+			return lastErr
+		}
 		remaining := time.Until(deadline)
 		if remaining <= 0 {
 			return lastErr
 		}

Also applies to: 193-200

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/chainsaw/inprocess.go` around lines 165 - 173, The retry loops
that call evaluateAssert and evaluateError must fail fast when the returned
error indicates a non-retryable bad test spec (ErrCodeInvalidRequest); update
both loops (the one around evaluateAssert and the one around evaluateError) to
check the returned error immediately after the call and if it represents
ErrCodeInvalidRequest (use the existing error sentinel/type check your codebase
uses, e.g. errors.Is(lastErr, ErrCodeInvalidRequest) or inspecting the error
code) return that error immediately instead of continuing to retry until the
deadline.

validators/chainsaw/fetch.go (1)

115-119: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use ErrCodeUnavailable for apiserver List failures.

At Line 117, resource.List transport/apiserver failures are wrapped as ErrCodeInternal, while the analogous Fetch path already classifies non-404 API failures as ErrCodeUnavailable (Lines 72-73). This can misclassify transient cluster outages upstream.

Suggested fix

-	return nil, errors.Wrap(errors.ErrCodeInternal,
+	return nil, errors.Wrap(errors.ErrCodeUnavailable,
 		fmt.Sprintf("failed to list %s in namespace %q", gvk, namespace), err)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/chainsaw/fetch.go` around lines 115 - 119, The List call error is
being wrapped with errors.ErrCodeInternal; change it to use
errors.ErrCodeUnavailable for apiserver/transport failures to match the Fetch
path behavior. Update the error wrapping around resource.List(ctx, opts) so that
failures return errors.Wrap(errors.ErrCodeUnavailable, ...) instead of
ErrCodeInternal, keeping the same message ("failed to list %s in namespace %q")
and preserving the wrapped err; reference symbols: resource.List,
ErrCodeInternal, ErrCodeUnavailable, Fetch, gvk, namespace.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@validators/chainsaw/fetch.go`:
- Around line 115-119: The List call error is being wrapped with
errors.ErrCodeInternal; change it to use errors.ErrCodeUnavailable for
apiserver/transport failures to match the Fetch path behavior. Update the error
wrapping around resource.List(ctx, opts) so that failures return
errors.Wrap(errors.ErrCodeUnavailable, ...) instead of ErrCodeInternal, keeping
the same message ("failed to list %s in namespace %q") and preserving the
wrapped err; reference symbols: resource.List, ErrCodeInternal,
ErrCodeUnavailable, Fetch, gvk, namespace.

In `@validators/chainsaw/inprocess.go`:
- Around line 165-173: The retry loops that call evaluateAssert and
evaluateError must fail fast when the returned error indicates a non-retryable
bad test spec (ErrCodeInvalidRequest); update both loops (the one around
evaluateAssert and the one around evaluateError) to check the returned error
immediately after the call and if it represents ErrCodeInvalidRequest (use the
existing error sentinel/type check your codebase uses, e.g. errors.Is(lastErr,
ErrCodeInvalidRequest) or inspecting the error code) return that error
immediately instead of continuing to retry until the deadline.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9262e585-929c-4279-9875-1494ceaa6129

📥 Commits

Reviewing files that changed from the base of the PR and between c94f051 and 8e1e714.

⛔ Files ignored due to path filters (2)

vendor/k8s.io/client-go/dynamic/fake/simple.go is excluded by !vendor/**
vendor/modules.txt is excluded by !vendor/**

📒 Files selected for processing (3)

validators/chainsaw/fetch.go
validators/chainsaw/fetch_test.go
validators/chainsaw/inprocess.go

ArangoGutierrez

Re-reviewed at 8e1e7143ca. All five points are addressed and the fixes hold up:

Named error fail-open — Fetch now splits apierrors.IsNotFound → ErrCodeNotFound from everything else → ErrCodeUnavailable, and evaluateError only treats a true NotFound as the happy path (stderrors.Is), otherwise it propagates and retries. Fails closed on transient apiserver errors now, matching the old binary.
Timeout cap — context.WithTimeout(ctx, effectiveTimeout) bounds the whole Test, so the N × effectiveTimeout blowup in the retrying case is gone.
Error-code wrapping — the ErrCodeInternal wraps around the fetcher calls are all gone; codes propagate untouched through executeStepInProcess and runChainsawTestInProcess. Went a bit further than I'd flagged.
List coverage — fetch_test.go is solid: table-driven over namespaced + cluster-scoped, label filtering, and the error-code mappings, plus a dedicated reactor test that ServiceUnavailable maps to ErrCodeUnavailable — that one pins the fail-closed fix directly. Real assertions against the real dynamic/fake client, no theater.
Per-op comment — now accurately describes the shared per-step deadline and the ignored timeouts.error.

One non-blocking leftover: pkg/defaults/timeouts.go still describes ChainsawAssertTimeout as the outer timeout for the chainsaw binary process that must exceed the 5m internal assert timeout. There's no binary now, and the outer cap is effectiveTimeout — which for every in-tree check is timeouts.assert (5m), so the 6m constant only kicks in as a fallback when a Test omits timeouts.assert. Net budget for a multi-step Test is ~5m vs the old ~6m, fine for these checks. Worth a one-line comment refresh next time that file is touched.

LGTM with that noted. Good turnaround.

Unconditionally capture /etc/containerd/config.toml and /etc/containerd/conf.d/99-nvidia.toml from every kind node container into the debug artifact tarball, even when COLLECT_NODE_RUNTIME_ARTIFACTS=false (the GPU jobs' current default). Cheap (two small files per node) and targets the nvidia-container-toolkit 1.19.1 vs kind worker containerd 2.3.1 schema-drift hypothesis under investigation in #1237 — the failing restart writes the latter file via nvidia-ctk, and we currently have no visibility into its contents post-mortem.

…runner The constant is no longer the outer cap for an external chainsaw binary process — #1236 replaced that path with the in-process runner, which caps each Test under context.WithTimeout(ctx, effectiveTimeout) where effectiveTimeout is the YAML's spec.timeouts.assert if set, otherwise this 6m fallback. Note also that every in-tree check sets timeouts.assert, so this default only applies to Tests that omit it.

mchmarny · 2026-06-09T15:56:46Z

Thanks for the thorough re-review, @ArangoGutierrez — picked up the ChainsawAssertTimeout comment refresh in bb3fd047. Rewrote it to describe its actual role as the fallback per-Test budget when YAML omits spec.timeouts.assert, and called out that #1236 removed the binary the prior comment referenced.

Holding off on merging for one more cycle — added a small unconditional dump of /etc/containerd/config.toml and /etc/containerd/conf.d/99-nvidia.toml to collect-debug-artifacts.sh in 9ecbf064 to capture the toml that nvidia-ctk emits, since the next GPU run will help unblock #1237.

Deployment-phase health-check validation regressed in the recent in-process executor + deepened-check stack (NVIDIA#1235/NVIDIA#1245/NVIDIA#1252): 1. inprocess.go: runAssertWithRetry/runErrorWithRetry retried ANY non-nil error, so a permanent JMESPath evaluation error was retried for the entire assert window (6m) instead of failing fast like the old chainsaw binary did. Classify assertion-engine errors as ErrCodeInvalidRequest (terminal) and short-circuit the retry loops on them. Adds a regression test asserting terminal eval errors fail fast (< AssertRetryInterval) rather than retrying to deadline. 2. recipes/checks/*: the (init)containerStatuses[?...] | length(@) projection throws 'invalid type for: <nil>' on any pod without (init) containers (the common case), feeding defect #1. Coalesce to an empty array across all 22 affected component health checks. 3. recipes/checks/dynamo-platform: validate-deployment-exists asserted a Deployment named 'dynamo-operator', but the chart prefixes it with the release name (dynamo-platform-dynamo-operator-controller-manager), so the assert never matched and retried to the deadline. Match by the stable app.kubernetes.io/name=dynamo-operator label instead. Verified on an EKS H100 cluster: deployment phase goes from an 8m timeout (status=other) to PASSED in ~21s; full deployment 4/4 and conformance 11/11 green.

mchmarny requested review from a team as code owners June 9, 2026 13:16

mchmarny added the enhancement label Jun 9, 2026

mchmarny self-assigned this Jun 9, 2026

github-actions Bot added area/ci area/docs size/XL labels Jun 9, 2026

This comment was marked as resolved.

Sign in to view

mchmarny and others added 2 commits June 9, 2026 06:43

Merge branch 'main' into feat/1236-pure-go-chainsaw

81d70ef

This comment was marked as resolved.

Sign in to view

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

ArangoGutierrez reviewed Jun 9, 2026

View reviewed changes

mchmarny added 2 commits June 9, 2026 08:27

This comment was marked as resolved.

Sign in to view

lockwobr approved these changes Jun 9, 2026

View reviewed changes

mchmarny merged commit f6cb3cd into main Jun 9, 2026
31 of 34 checks passed

mchmarny deleted the feat/1236-pure-go-chainsaw branch June 9, 2026 17:13

mchmarny mentioned this pull request Jun 9, 2026

fix(ci): build patched nvkind with --config-source=file (#1237) #1258

Merged

11 tasks

yuanchen8911 mentioned this pull request Jun 9, 2026

fix(validator): fail fast on eval errors; nil-safe health checks #1261

Merged

11 tasks

This was referenced Jun 11, 2026

chore: trim agent rules under context limit, add review patterns #1322

Merged

[Tracker]: validate/recipe CLI UX papercuts (status color, output clarity) #1383

Closed

[Bug]: UX problems + actually fails to deploy eventually #1361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(validators): replace chainsaw binary with in-process executor#1252

feat(validators): replace chainsaw binary with in-process executor#1252
mchmarny merged 6 commits into
mainfrom
feat/1236-pure-go-chainsaw

mchmarny commented Jun 9, 2026

Uh oh!

github-actions Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ArangoGutierrez left a comment

Uh oh!

mchmarny commented Jun 9, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mchmarny commented Jun 9, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ArangoGutierrez left a comment

Choose a reason for hiding this comment

Uh oh!

mchmarny commented Jun 9, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 9, 2026 •

edited

Loading