refactor(kwok): migrate argocd-* and flux-oci sync gate from bash to chainsaw#1050
Conversation
…chainsaw
Replaces ~500 lines of bash + jq sync-wait machinery in
kwok/scripts/validate-scheduling.sh (wait_for_argocd_root_app,
wait_for_argocd_sync, dump_argocd_failures, and the symmetric flux-*
peers) with idiomatic Chainsaw scenarios under tests/chainsaw/kwok/.
What moved to Chainsaw:
- tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml: asserts the root
Application is present (30s grace) and every Application in the
argocd namespace reaches one of the 4 terminal-pass states (Synced+
Healthy, Synced+Progressing, OutOfSync+Healthy+Succeeded, Synced+
Degraded+Succeeded). 4-arm OR is a single JMESPath expression
nested under `status:`. catch: block dumps per-App status +
repo-server + application-controller logs (StatefulSet/Deployment
fallback preserved verbatim).
- tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml: sequential
assertions on OCIRepository → Kustomization → HelmReleases →
ArtifactGenerators, each Ready=True. Per-step catch: blocks dump
the controller logs for the failing stage.
What stayed in bash:
- Per-deployer binding selection: ARGOCD_ROOT_APP and the Flux CR
names are still set in the matrix-shape preamble.
- Exit-code translation: Chainsaw collapses pass/fail to 0/1; the
bash shim re-emits EXIT_ARGOCD_SYNC_TIMEOUT (50) on chainsaw
failure so run-all-recipes.sh's 3-strike rule still fires.
Behavior difference vs. the old gate: hard errors (RBAC, CRD
missing) consume strike budget instead of failing fast. Accepted
tradeoff — 3 attempts then bail vs. 1.
What's deferred:
- Negative-path test (issue #962 acceptance criterion #2). Requires
fixture infrastructure to deploy a known-bad Application; out of
scope for this PR. Follow-up.
Validation status:
Cannot run KWOK + ArgoCD locally. Chainsaw `lint test` validates the
schema (passes for both files). yamllint and shellcheck pass. The
4-arm JMESPath expression syntax is structurally correct but its
runtime behavior is validated only by CI's argocd-* and flux-oci
matrix lanes. Marking PR as draft until CI confirms parity.
Fixes #962
Coverage Report ✅
Coverage BadgeNo Go source files changed in this PR. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR replaces in-script Argo CD and Flux sync polling with Chainsaw Test manifests and updates validate-scheduling.sh to call those tests. It adds tests/tests/chainsaw/kwok/argocd-sync and tests/chainsaw/kwok/flux-sync, refactors wait_for_argocd_sync and wait_for_flux_sync to invoke chainsaw, and updates the kwok-test composite action and workflows to install and pass Chainsaw version/checksum inputs. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@kwok/scripts/validate-scheduling.sh`:
- Around line 1200-1203: The script is passing an inline key=value to Chainsaw's
--values (which expects YAML), so update the invocation in
validate-scheduling.sh where the variable chainsaw_bin is used: replace --values
"rootApp=${ARGOCD_ROOT_APP}" with either --set "rootApp=${ARGOCD_ROOT_APP}" (or
--set-string if you need to preserve it as a string) or feed proper YAML to
--values via a heredoc or temp file; change the call around the test invocation
that references chainsaw_bin and adjust quoting accordingly so the value is
passed with --set/--set-string or as a YAML file/stdin.
In `@tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml`:
- Around line 105-117: The HelmRelease assertion "assert-all-helmreleases-ready"
is too strict—change the predicate to accept either Ready==True OR the terminal
fallback (Released==True with the first history entry marked deployed and
Stalled!=True); update the assert under resource HelmRelease to check
(conditions[?type=='Ready']|[0].status == 'True') OR (status.release.status ==
'Released' AND status.history[0].deployed == true AND
(conditions[?type=='Stalled']|[0].status != 'True')) so KWOK/Flux gate contract
semantics are preserved.
- Around line 142-145: The readiness assertion for kind ArtifactGenerator uses
the wrong apiVersion; update the selector in
tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml so ArtifactGenerator is
referenced as apiVersion: source.extensions.fluxcd.io/v1beta1 (not
source.toolkit.fluxcd.io/v1beta2) so the readiness check
(conditions[?type=='Ready']|[0].status == 'True') can match actual CRs; locate
the ArtifactGenerator block in the YAML and replace the apiVersion value
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 24293c8b-7720-4972-acc2-816cebd086c2
📒 Files selected for processing (3)
kwok/scripts/validate-scheduling.shtests/chainsaw/kwok/argocd-sync/chainsaw-test.yamltests/chainsaw/kwok/flux-sync/chainsaw-test.yaml
Three valid findings from the initial CodeRabbit review: 1. Chainsaw `--values` is YAML-only (file/stdin/heredoc); inline `key=value` was a silent no-op that left bindings undefined, causing every argocd-* lane to assert against the default `aicr-stack` rootApp regardless of deployer. Switched both shims to `--set`, which is chainsaw's inline override flag and populates the same $values binding. This is the root cause of the failing argocd-oci and argocd-helm-oci matrix jobs. 2. HelmRelease pass criteria narrowed from the bash gate. The old `wait_for_flux_sync` accepted Ready=True OR the terminal fallback (Released=True + history[0]==deployed + Stalled\!=True) to avoid false negatives on KWOK lanes where helm-controller's install succeeded but the Ready condition lagged. Re-added the fallback as a JMESPath disjunction. Stalled\!=True gate prevents masking a genuinely failed reconcile with a stale prior Released. 3. ArtifactGenerator apiVersion. CR lives under `source.extensions.fluxcd.io/v1beta1` (extension API), not `source.toolkit.fluxcd.io/v1beta2` (core source-controller group). Matches the apiVersion the bundler emits at pkg/bundler/deployer/flux/templates/artifactgenerator.yaml.tmpl; the prior incorrect group would have silently matched zero CRs and no-op'd the stage-4 check. Verification: - chainsaw lint test: both files valid - bash -n + shellcheck: clean (pre-existing SC1091 unrelated) - yamllint: clean Still draft — CI will tell us whether the JMESPath expressions behave correctly against live resources.
CI on PR #1050 surfaced "chainsaw not found on PATH" — the kwok-test composite action installs kind/helm/kubectl/yq/flux/goreleaser but did not install chainsaw. The bash sync gate that PR #1050 migrated now depends on chainsaw at runtime via kwok/scripts/validate-scheduling.sh. Three wiring changes: - .github/actions/kwok-test/action.yml: add chainsaw_version + chainsaw_sha256 inputs and pass them through to setup-build-tools with install_chainsaw=true. - Same file: cache /usr/local/bin/chainsaw alongside the other tool binaries in the kwok-tools cache. - .github/workflows/kwok-recipes.yaml: thread the chainsaw version and sha256 from steps.versions.outputs into all three kwok-test matrix calls (Tier 1, Tier 2, Tier 3).
CI surfaced: chainsaw's per-test temp namespace stalls in Terminating on KWOK clusters (no kube-controller-manager to finalize the `kubernetes` namespace finalizer), the 30s cleanup timeout fires, and the test is reported as failed even when every assertion passed inside 8s. Set spec.skipDelete: true on both chainsaw scenarios. Safe because neither test creates resources of its own — both only assert on existing Application / HelmRelease / OCIRepository CRs in argocd or flux-system namespaces. Verified locally with chainsaw lint test (both valid).
…piVersion Two issues surfaced by CI on the post-skipDelete run: 1. --set rootApp=nvidia-stack didn't override the binding. chainsaw's --set populates the implicit \$values binding, NOT the explicit spec.bindings — they're separate namespaces. Result: argocd-oci lane (rootApp=nvidia-stack) ran with the binding default (aicr-stack) and failed "actual resource not found" on every argocd-oci/flux-oci matrix job that didn't happen to match the default. Threaded the override through: bindings now read `(\$values.<key> || '<default>')` so the CLI --set value wins, with the literal fallback preserved for ad-hoc local runs. 2. OCIRepository apiVersion source.toolkit.fluxcd.io/v1beta2 — the stable group/version under Flux v2.x is /v1 (matches what the bundler emits at pkg/bundler/deployer/flux/templates/). The v1beta2 selector matched zero CRs and the assertion errored with "no matches for source.toolkit.fluxcd.io/v1beta2, Resource=". Switched to /v1. Pattern from the previous CI run that confirms the diagnosis: helm and argocd-helm-oci passed (argocd-helm-oci's rootApp default aicr-stack matched the actual root App), argocd-oci and flux-oci failed (rootApp/apiVersion mismatch).
Two flux-sync failures from CI run 26484518107:
1. HelmRelease assertion was implicitly scoped to chainsaw's per-test
namespace (chainsaw-main-goblin), where no HRs exist:
ASSERT | RUN | helm.toolkit.fluxcd.io/v2/HelmRelease @ chainsaw-main-goblin/*
ASSERT | ERROR | "no actual resource found"
The bundler places every HelmRelease in the Flux install namespace
(Generator.Namespace, default "flux-system" via
config.DefaultFluxNamespace). Added explicit metadata.namespace:
flux-system so the selector finds them.
2. ArtifactGenerator step was `assert:` which fails on zero-match.
Pure-Helm flux bundles don't emit ArtifactGenerators at all, so
the step would fail on every recipe without local charts. Flipped
to `error:` ("no AG in non-Ready state"), which:
- PASSES on zero AGs (vacuously true)
- PASSES when every AG is Ready=True
- FAILS when any AG is non-Ready
Same semantic as the pre-migration bash gate's optional check.
|
Heads up — looks like the new chainsaw sync gate races ahead of child Application materialization for App-of-Apps deployers, causing intermittent (now reliable) failures on Filed #1061 with the analysis and a proposed minimum-invasive fix (add Side-by-side pod-count evidence in the issue (35 pre-migration → 13 occasional pass → 5 reliable fail). No prior commit between 15:17 and 15:36 UTC triggered the transition, so it's a race condition that's now being lost more often, not a separate regression. Happy to put up a PR if you'd like. |
…IA#1061) The chainsaw assertion at `assert-all-applications-pass` could pass within ~4-5 seconds of helm install, before Argo CD had completed the initial App-of-Apps sync that creates the child Applications. With only the root Application present in the argocd namespace and the predicate satisfied for that root alone, the assertion returned PASS while the cluster was still half-deployed. The subsequent pod-scheduling check in `validate-scheduling.sh` then tripped on KIND-default kube-system pods (coredns, local-path-provisioner) that the AICR bundle's pods would otherwise have rendered insignificant. Same-SHA evidence: run 26523455927 on `8b5ef744` failed on first attempt and passed on retry of identical content — a textbook race-condition signature. The pre-NVIDIA#1050 bash gate avoided the race because its polling loop took long enough that children materialized before the next re-list. Tighten the JMESPath predicate so all four arms require `operationState.phase == 'Succeeded'`. An Application can only count as passing after Argo CD has completed at least one sync operation; for the App-of-Apps that's the gate ensuring the children exist and join the selector's match set on subsequent polls. The four sync/health combinations remain (canonical pass + the three KWOK/operator-mutation relaxations the old bash gate accepted), so legitimate Application states still resolve to PASS once Argo CD has actually run. The sibling flux-sync chainsaw test does not need a parallel change: it asserts `Ready=True` conditions on HelmRelease and Kustomization, which already require their controllers to have completed reconciliation. Also add `tests/chainsaw/kwok/**` to the `KWOK Cluster Validation` workflow path filters. PR NVIDIA#1050 placed the new chainsaw scenarios under `tests/chainsaw/kwok/`, but the workflow filters were not updated to match — so changes to the scenarios silently skipped KWOK CI. Now any edit to the chainsaw KWOK scenarios re-runs the matrix automatically. Fixes NVIDIA#1061 Related NVIDIA#1050 (the chainsaw migration that introduced the race)
Summary
Migrates the KWOK
argocd-*andflux-ocisync gate from ~500 lines of bash + jq (wait_for_argocd_sync,wait_for_argocd_root_app,dump_argocd_failures, and the symmetric flux-* peers) to idiomatic Chainsaw scenarios undertests/chainsaw/kwok/.Final diff: 5 files, -520 / +362 LOC (net -158).
Fixes #962
Motivation / Context
#962 proposed moving the argocd-* sync gate to Chainsaw to (a) shorten the bash surface, (b) get structured per-resource failure output instead of opaque "GitOps sync timeout strike 1/3", and (c) align with the rest of the test surface. The flux-oci sync gate was added symmetrically after the issue was filed; this PR migrates both in one shot to avoid a half-bash / half-chainsaw state.
Type of Change
Component(s) Affected
kwok/scripts/validate-scheduling.sh)tests/chainsaw/kwok/(new).github/actions/kwok-test/action.yml+.github/workflows/kwok-recipes.yaml(chainsaw install wiring)Implementation Notes
Chainsaw scenarios:
tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml— asserts root Application is present (30s grace) and every Application inargocdnamespace reaches one of the 4 terminal-pass states. The 4-arm OR is a single JMESPath expression nested understatus:.catch:block dumps per-App status + repo-server + application-controller logs (StatefulSet/Deployment fallback preserved for Argo CD chart 9.x vs. older).tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml— sequential per-stage assertions: OCIRepository → Kustomization → HelmReleases → ArtifactGenerators, eachReady=True. Per-stepcatch:blocks dump the controller logs for the failing stage. The HelmRelease predicate accepts bothReady=TrueAND the terminal fallback (Released=True + history[0]=='deployed' + Stalled!=True), mirroring the pre-migration bash contract so KWOK pod-readiness gaps don't false-negative.Bash shim (thin):
wait_for_argocd_sync()andwait_for_flux_sync()survive as ~30-line shims that invokechainsaw testwith the right--setbindings and translate exit codes.wait_for_argocd_root_app,dump_argocd_failures,wait_for_flux_root_app,dump_flux_failures) are removed.What bash still owns:
ARGOCD_ROOT_APP, Flux CR names) — still set in the matrix-shape preamble, passed to chainsaw via--set.EXIT_ARGOCD_SYNC_TIMEOUT=50on chainsaw failure sorun-all-recipes.sh's 3-strike rule still fires. Behavior difference vs. the old gate: hard errors (RBAC, CRD missing) now consume strike budget instead of failing fast (3 attempts then bail vs. 1). Acceptable tradeoff for the LOC reduction; documented in the shim comment.CI wiring:
.github/actions/kwok-test/action.yml: addedchainsaw_version+chainsaw_sha256inputs, threaded through tosetup-build-toolswithinstall_chainsaw: 'true', added/usr/local/bin/chainsawto the kwok-tools cache key..github/workflows/kwok-recipes.yaml: passes the versions fromsteps.versions.outputs.chainsaw/chainsaw_sha256_linux_amd64into all three kwok-test matrix calls (Tier 1 / 2 / 3).Diagnostic parity: every kubectl-log target in the prior
dump_*_failuresis reproduced in a chainsawcatch:script block, including the StatefulSet/Deployment fallback forargocd-application-controller(Argo chart 9.x).What's deferred
Iterations during review
The migration required several CI cycles to converge on the correct chainsaw idioms. Each fix is its own commit for bisectability:
refactor(kwok): migrate argocd-* and flux-oci sync gate from bash to chainsaw— initial migration.fix(kwok): address CodeRabbit review on chainsaw sync gate— three pre-CI findings:--values key=value→--set(the former is YAML-only; the inline override was a silent no-op).Released+deployed+!Stalledterminal-fallback arm.apiVersioncorrected tosource.extensions.fluxcd.io/v1beta1(extension API).fix(ci): install chainsaw on KWOK matrix runners— first CI run surfaced "chainsaw not found on PATH". Addedinstall_chainsaw: 'true'+ version+sha256 inputs to the kwok-test composite action and threaded the values from the workflow.fix(kwok): skip chainsaw cleanup-delete on KWOK clusters— second run surfacedcleanup ERROR: context deadline exceeded. Chainsaw auto-creates a per-test temp namespace and deletes it during cleanup; on KWOK there's no kube-controller-manager to finalize thekubernetesnamespace finalizer, so the namespace stays in Terminating indefinitely.spec.skipDelete: trueresolves it (the tests create no resources of their own).fix(kwok): pull --set overrides through bindings; fix OCIRepository apiVersion— third run: argocd-oci failedactual resource not foundonargocd/aicr-stack(rootApp default) because chainsaw's--setpopulates$values, notbindings. Routed bindings through($values.<key> || '<default>'). Same run: flux-oci failedno matches for source.toolkit.fluxcd.io/v1beta2— Flux v2.x serves OCIRepository at/v1, not v1beta2.fix(kwok): scope HelmRelease + ArtifactGenerator asserts to flux-system— fourth run: HelmRelease assertion was implicitly scoped to chainsaw's per-test temp namespace (chainsaw-main-goblin/*) where no HRs exist. Added explicitmetadata.namespace: flux-system. Also flipped ArtifactGenerator fromassert:(which fails on zero-match) toerror:(which fires only when a non-Ready AG exists) so pure-Helm flux bundles don't false-negative.Testing
Final CI run on
5c00091a: all 70 KWOK matrix jobs green (helm × 14 + argocd-oci × 14 + argocd-helm-oci × 14 + flux-oci × 14 across aks/eks/gke-cos/kind/lke/oke at base/inference/training), plus the broader PR status (Qualification, ClamAV, grype, etc.) — 89 success, 0 failures, 12 skipped (Tier 2/3 lanes that only run on schedule).Local checks:
Risk Assessment
Rollout notes: no runtime impact. Affects only the KWOK CI matrix driver. After merge, open PRs rebasing onto main will see the new sync-gate path; if anything regresses, revert is one git command.
Checklist
make lint; chainsawlint test; shellcheck; yamllint)recipes/checks/*/health-check.yamlchainsaw idiom)git commit -S)