fix(kwok): require Succeeded operationState in argocd sync gate#1062
Conversation
📝 WalkthroughWalkthroughThis PR gates all four Chainsaw "pass" arms for Argo CD Applications on Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…IA#1061) The chainsaw assertion at `assert-all-applications-pass` could pass within ~4-5 seconds of helm install, before Argo CD had completed the initial App-of-Apps sync that creates the child Applications. With only the root Application present in the argocd namespace and the predicate satisfied for that root alone, the assertion returned PASS while the cluster was still half-deployed. The subsequent pod-scheduling check in `validate-scheduling.sh` then tripped on KIND-default kube-system pods (coredns, local-path-provisioner) that the AICR bundle's pods would otherwise have rendered insignificant. Same-SHA evidence: run 26523455927 on `8b5ef744` failed on first attempt and passed on retry of identical content — a textbook race-condition signature. The pre-NVIDIA#1050 bash gate avoided the race because its polling loop took long enough that children materialized before the next re-list. Tighten the JMESPath predicate so all four arms require `operationState.phase == 'Succeeded'`. An Application can only count as passing after Argo CD has completed at least one sync operation; for the App-of-Apps that's the gate ensuring the children exist and join the selector's match set on subsequent polls. The four sync/health combinations remain (canonical pass + the three KWOK/operator-mutation relaxations the old bash gate accepted), so legitimate Application states still resolve to PASS once Argo CD has actually run. The sibling flux-sync chainsaw test does not need a parallel change: it asserts `Ready=True` conditions on HelmRelease and Kustomization, which already require their controllers to have completed reconciliation. Also add `tests/chainsaw/kwok/**` to the `KWOK Cluster Validation` workflow path filters. PR NVIDIA#1050 placed the new chainsaw scenarios under `tests/chainsaw/kwok/`, but the workflow filters were not updated to match — so changes to the scenarios silently skipped KWOK CI. Now any edit to the chainsaw KWOK scenarios re-runs the matrix automatically. Fixes NVIDIA#1061 Related NVIDIA#1050 (the chainsaw migration that introduced the race)
93eb4d6 to
569d514
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Tight, focused fix. The predicate change is the right shape: factoring operationState.phase == 'Succeeded' out as a top-level conjunct makes all four arms require a completed sync operation, which rejects the pre-sync window where the App-of-Apps root can briefly satisfy Synced before children are materialized. Boolean check holds (A ∨ B ∨ (C∧S) ∨ (D∧S) → S ∧ (A ∨ B ∨ C ∨ D) is strictly stricter), and JMESPath fails closed on missing operationState so there's no surprise behavior on the cold path.
Path-filter addition is the other half of the fix — caught by Codex review on the first push, now correctly added to both push and pull_request paths so future chainsaw scenario edits actually re-trigger KWOK CI.
Comment block is also a notable upgrade — explains the race, references #1061 and #1050, and keeps the arms table readable after pulling the per-arm op.phase notes into the unified conjunct.
CI green (101 checks). "blocked" mergeable state is required-reviewer, not a CI signal. The status.resources non-empty follow-up the author flagged is a sensible belt-and-suspenders that doesn't block this PR.
verify_pods() in kwok/scripts/validate-scheduling.sh issued five
separate `kubectl get pods` calls — total / pending / failed / running /
unscheduled — over several seconds (the unscheduled query alone takes
hundreds of ms due to `-o json | jq`). Between those calls, operator
controllers (gpu-operator, dynamo, dra-driver, cert-manager,
prometheus, …) kept reconciling and creating new pods, so the
snapshots disagreed with each other.
Symptom: arithmetic that cannot come from any single Kubernetes state:
Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
ERROR: Unscheduled pods:
(kubectl prints "No resources found")
ERROR: Events for unscheduled pods:
kube-system coredns-… FailedScheduling untolerated taint(s)
...
`unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1
can only happen across two different state samples. By the time the
diagnostic re-query ran, the bundle pod had already been bound by the
scheduler, so the listing came up empty; the unfiltered event dump
then surfaced kube-system FailedScheduling noise from the intentional
`nfd-excluded=true:NoSchedule` taint on the kind control-plane node,
misleading at least two prior root-cause analyses (issue NVIDIA#1090 has
the full evidence trail across jobs 78343519974 and 78341110233).
This change makes verify_pods deterministic by:
1. Taking ONE `kubectl get pods --all-namespaces -o json` snapshot
at the top of the function and deriving all five counts plus the
pod distribution from it via jq.
2. Rendering the "Unscheduled pods:" diagnostic from the same
snapshot — never prints "No resources found" when the count is
positive, because they're computed from one source of truth.
3. Filtering the FailedScheduling event dump with the same namespace
exclusions used by the count (kube-system, kube-node-lease,
kube-public, local-path-storage, kwok-system, argocd, flux-system,
aicr-registry). Persistent harness noise from the intentional
nfd-excluded taint no longer surfaces as if it were the cause.
4. (Drive-by, same shape) Rendering the failed-pod diagnostic from
the same snapshot for consistency.
5. Failing closed when `kubectl get pods` itself errors. The pre-
existing `2>/dev/null | wc -l` chain silently produced
total_pods=0 in that case and flowed through the "bundle may be
empty" PASS branch — so an apiserver outage, auth/RBAC failure,
or bad kube context would be reported as a scheduling pass. The
rewrite captures stderr so the failure mode is visible in the
job log, then returns 1.
Race classes NVIDIA#1090 explicitly does NOT solve:
- The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062).
- The scheduler genuinely not having bound a pod yet at the moment
the snapshot was taken. The single-snapshot view will still flag
that as a failure, which is correct. An optional re-snapshot-once
retry was considered and deferred to keep this change minimal —
see NVIDIA#1090 acceptance criteria.
Smoke-tested locally with a synthetic snapshot covering excluded
namespaces, Job-owned helm hooks, and Failed pods. `make qualify`
clean (22/22 chainsaw, no vulns, lint 0 issues).
Fixes: NVIDIA#1090
NVIDIA#1090) verify_pods() in kwok/scripts/validate-scheduling.sh had two flake classes that produced false failures on the KWOK Tier-1 lanes: 1. TOCTOU race across multiple `kubectl get pods` calls. The previous implementation issued five separate queries — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between calls, operator controllers (gpu-operator, dynamo, dra- driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl returned "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 is impossible in a single state sample. The transient pod had been bound by the time the diagnostic re-queried, and the unfiltered event dump then surfaced kube-system noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node — misleading at least two prior root-cause analyses. 2. Post-sync-gate scheduler latency. After Argo CD / Flux reports Synced+Healthy, controllers continue creating pods that have not yet been bound. A single coherent snapshot taken in that window correctly reports the pre-bind transient as a failure (verified on the eks-inference/argocd-oci and oke-training/argocd-oci lanes — cert-manager-786cc54854-kngvm and skyhook-operator-controller- manager-596b99b857-8shfm respectively, both with zero FailedScheduling events, meaning the scheduler simply had not gotten to them yet). This change makes verify_pods deterministic by: 1. Taking one `kubectl get pods --all-namespaces -o json` snapshot per attempt and deriving all five counts plus the pod distribution from it via jq. Within an attempt, the math is guaranteed coherent — the impossible-pending/unscheduled signature cannot reappear. 2. Re-snapshotting on a bounded poll loop (default 60s deadline at 5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE / KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop exits early as soon as a snapshot shows unscheduled == 0 (or any Failed pod, which is terminal). Only declares failure when the deadline is reached AND unscheduled is still positive. 3. Rendering the "Unscheduled pods:" diagnostic and pod distribution from the same (final) snapshot — never prints "No resources found" when the count is positive. 4. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count. Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. The rewrite captures kubectl's stderr so the failure mode appears in the job log, then returns 1. 6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - A genuinely unschedulable pod (missing toleration, no matching nodeSelector, insufficient resources). Those persist through the full deadline and are reported with the same diagnostic as before — now correctly filtered for namespace noise. Locally smoke-tested with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090
NVIDIA#1090) verify_pods() in kwok/scripts/validate-scheduling.sh had two flake classes that produced false failures on the KWOK Tier-1 lanes: 1. TOCTOU race across multiple `kubectl get pods` calls. The previous implementation issued five separate queries — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between calls, operator controllers (gpu-operator, dynamo, dra- driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl returned "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 is impossible in a single state sample. The transient pod had been bound by the time the diagnostic re-queried, and the unfiltered event dump then surfaced kube-system noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node — misleading at least two prior root-cause analyses. 2. Post-sync-gate scheduler latency. After Argo CD / Flux reports Synced+Healthy, controllers continue creating pods that have not yet been bound. A single coherent snapshot taken in that window correctly reports the pre-bind transient as a failure (verified on the eks-inference/argocd-oci and oke-training/argocd-oci lanes — cert-manager-786cc54854-kngvm and skyhook-operator-controller- manager-596b99b857-8shfm respectively, both with zero FailedScheduling events, meaning the scheduler simply had not gotten to them yet). This change makes verify_pods deterministic by: 1. Taking one `kubectl get pods --all-namespaces -o json` snapshot per attempt and deriving all five counts plus the pod distribution from it via jq. Within an attempt, the math is guaranteed coherent — the impossible-pending/unscheduled signature cannot reappear. 2. Re-snapshotting on a bounded poll loop (default 60s deadline at 5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE / KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop exits early as soon as a snapshot shows unscheduled == 0 (or any Failed pod, which is terminal). Only declares failure when the deadline is reached AND unscheduled is still positive. 3. Rendering the "Unscheduled pods:" diagnostic and pod distribution from the same (final) snapshot — never prints "No resources found" when the count is positive. 4. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count. Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. The rewrite captures kubectl's stderr so the failure mode appears in the job log, then returns 1. 6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - A genuinely unschedulable pod (missing toleration, no matching nodeSelector, insufficient resources). Those persist through the full deadline and are reported with the same diagnostic as before — now correctly filtered for namespace noise. Locally smoke-tested with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090
NVIDIA#1090) verify_pods() in kwok/scripts/validate-scheduling.sh had two flake classes that produced false failures on the KWOK Tier-1 lanes: 1. TOCTOU race across multiple `kubectl get pods` calls. The previous implementation issued five separate queries — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between calls, operator controllers (gpu-operator, dynamo, dra- driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl returned "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 is impossible in a single state sample. The transient pod had been bound by the time the diagnostic re-queried, and the unfiltered event dump then surfaced kube-system noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node — misleading at least two prior root-cause analyses. 2. Post-sync-gate scheduler latency. After Argo CD / Flux reports Synced+Healthy, controllers continue creating pods that have not yet been bound. A single coherent snapshot taken in that window correctly reports the pre-bind transient as a failure (verified on the eks-inference/argocd-oci and oke-training/argocd-oci lanes — cert-manager-786cc54854-kngvm and skyhook-operator-controller- manager-596b99b857-8shfm respectively, both with zero FailedScheduling events, meaning the scheduler simply had not gotten to them yet). This change makes verify_pods deterministic by: 1. Taking one `kubectl get pods --all-namespaces -o json` snapshot per attempt and deriving all five counts plus the pod distribution from it via jq. Within an attempt, the math is guaranteed coherent — the impossible-pending/unscheduled signature cannot reappear. 2. Re-snapshotting on a bounded poll loop (default 60s deadline at 5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE / KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop exits early as soon as a snapshot shows unscheduled == 0 (or any Failed pod, which is terminal). Only declares failure when the deadline is reached AND unscheduled is still positive. 3. Rendering the "Unscheduled pods:" diagnostic and pod distribution from the same (final) snapshot — never prints "No resources found" when the count is positive. 4. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count. Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. The rewrite captures kubectl's stderr so the failure mode appears in the job log, then returns 1. 6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - A genuinely unschedulable pod (missing toleration, no matching nodeSelector, insufficient resources). Those persist through the full deadline and are reported with the same diagnostic as before — now correctly filtered for namespace noise. Locally smoke-tested with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090
Summary
Tighten the argocd-* KWOK chainsaw sync gate so all four arms of the pass-state predicate require
operationState.phase == 'Succeeded'. Removes a race in which the assertion could PASS within ~4-5 s ofhelm install— before Argo CD had completed its initial App-of-Apps sync and created the child Applications.Also add
tests/chainsaw/kwok/**to theKWOK Cluster Validationworkflow path filters so the scenarios actually re-run on edit (per Codex review of the first push).Motivation / Context
The chainsaw assertion at
tests/chainsaw/kwok/argocd-sync/chainsaw-test.yamlsays "everyApplicationin the argocd namespace must satisfy the pass predicate." Right afterhelm installof an argocd-helm-oci bundle, the only Application present is the root App-of-Apps. The controller can briefly set the root'ssync.status=Synced(manifest matches cluster) before running the sync operation that creates the children. In that window the assertion sees a single matching Application that satisfies arm 1 or arm 2 and returns PASS — children haven't been created yet, so they aren't in the match set.Pre-#1050, the bash
wait_for_argocd_syncloop avoided the race accidentally: its polling overhead took long enough that children materialized between iterations and joined the pass-check by the time it re-listed.Evidence the race is real (issue #1061 has the full table):
26523455927(initial)8b5ef74426523455927(re-run)8b5ef744(same)Same commit, different result — textbook race-condition signature. Per-job failure rate is low (~1-3%), but with 68 parallel Tier 1 jobs the workflow surfaces it on a meaningful fraction of runs, costing CI minutes and re-runs.
Fixes: #1061
Related: #1050 (the chainsaw migration that introduced the race), #962 (the parent tracking issue)
Type of Change
.github/workflows/kwok-recipes.yamlpath filter expansion)Component(s) Affected
tests/chainsaw/kwok/argocd-sync/) and workflow path filtersImplementation Notes
Predicate change. The JMESPath OR previously required
operationState.phase == 'Succeeded'only on arms 3 and 4 (the relaxations for operator mutation and health-controller divergence). Arms 1 and 2 (Synced + Healthy/Synced + Progressing) could match purely onsync.status+health.status, which is what the App-of-Apps root briefly satisfies before its sync operation runs. Moving theoperationState.phaseguard out as a conjunct that gates all four arms ensures an Application only passes after Argo CD has completed at least one sync operation.Why this isn't too strict. Within the existing 5-minute assertion timeout, Argo CD's automated sync completes well within the budget (typical KWOK convergence: 30-90 s including child sync). The four sync/health combinations the old bash gate accepted remain valid; only the "pre-sync transient" window is now rejected.
Workflow path filter.
.github/workflows/kwok-recipes.yamlpreviously gated KWOK onrecipes/**,kwok/**, the workflow file, and.github/actions/kwok-test/**. PR #1050 introducedtests/chainsaw/kwok/as the new home for the migrated argocd / flux chainsaw scenarios, but didn't expand the filter — so edits to the scenarios silently skipped KWOK CI. The first push of this PR exhibited the symptom (0 KWOK jobs scheduled, caught by Codex review). Addedtests/chainsaw/kwok/**to bothpush.pathsandpull_request.paths.Flux sibling not affected.
tests/chainsaw/kwok/flux-sync/chainsaw-test.yamlassertsReady=Trueconditions onHelmReleaseandKustomization, which already require their controllers to have completed reconciliation. No parallel change needed.Comment block updated to explain why all four arms now share the
operationState.phaserequirement.Testing
yamllint tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml— cleanchainsaw lint test -f tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml— "The document is valid"yamllint .github/workflows/kwok-recipes.yaml— clean5 pods total, 3 unscheduledsignature reappears; if the fix lands, the assertion converges normally or times out at 5 m with a clear "child apps not Succeeded" diagnostic.make qualifywas not run end-to-end: change is test-infrastructure-only (no Go, no recipes, no docs-sidebar impact), and the relevant gates (yamllint, chainsaw lint) pass.Risk Assessment
Rollout notes: No flag, no opt-in. The next KWOK run after merge exercises the new predicate. Existing passing PRs are unaffected. The path-filter expansion is purely additive.
Follow-up
Codex review noted that the old bash gate had an explicit child-Application count guard, while this fix uses root-sync success as the materialization proxy. The proxy is sufficient for the reported flake (root cannot reach
Succeededwithout having created children), but a follow-up "wait forroot.status.resourcesnon-empty" step would restore full pre-#1050 parity for cases where the count itself is the invariant. Tracked for a follow-up if reviewers want belt-and-suspenders.Checklist
make testwith-race) — N/A (no Go change)make lint) —yamllintclean on both touched files,chainsaw lintvalidgit commit -S)