fix(kwok): require Succeeded operationState in argocd sync gate by yuanchen8911 · Pull Request #1062 · NVIDIA/aicr

yuanchen8911 · 2026-05-27T16:45:44Z

Summary

Tighten the argocd-* KWOK chainsaw sync gate so all four arms of the pass-state predicate require operationState.phase == 'Succeeded'. Removes a race in which the assertion could PASS within ~4-5 s of helm install — before Argo CD had completed its initial App-of-Apps sync and created the child Applications.

Also add tests/chainsaw/kwok/** to the KWOK Cluster Validation workflow path filters so the scenarios actually re-run on edit (per Codex review of the first push).

Motivation / Context

The chainsaw assertion at tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml says "every Application in the argocd namespace must satisfy the pass predicate." Right after helm install of an argocd-helm-oci bundle, the only Application present is the root App-of-Apps. The controller can briefly set the root's sync.status=Synced (manifest matches cluster) before running the sync operation that creates the children. In that window the assertion sees a single matching Application that satisfies arm 1 or arm 2 and returns PASS — children haven't been created yet, so they aren't in the match set.

Pre-#1050, the bash wait_for_argocd_sync loop avoided the race accidentally: its polling overhead took long enough that children materialized between iterations and joined the pass-check by the time it re-listed.

Evidence the race is real (issue #1061 has the full table):

Run	SHA	Outcome
`26523455927` (initial)	`8b5ef744`	❌
`26523455927` (re-run)	`8b5ef744` (same)	✅

Same commit, different result — textbook race-condition signature. Per-job failure rate is low (~1-3%), but with 68 parallel Tier 1 jobs the workflow surfaces it on a meaningful fraction of runs, costing CI minutes and re-runs.

Fixes: #1061
Related: #1050 (the chainsaw migration that introduced the race), #962 (the parent tracking issue)

Type of Change

Bug fix (non-breaking change that fixes an issue)
Build/CI/tooling (.github/workflows/kwok-recipes.yaml path filter expansion)

Component(s) Affected

Other: KWOK CI scenarios (tests/chainsaw/kwok/argocd-sync/) and workflow path filters

Implementation Notes

Predicate change. The JMESPath OR previously required operationState.phase == 'Succeeded' only on arms 3 and 4 (the relaxations for operator mutation and health-controller divergence). Arms 1 and 2 (Synced + Healthy / Synced + Progressing) could match purely on sync.status + health.status, which is what the App-of-Apps root briefly satisfies before its sync operation runs. Moving the operationState.phase guard out as a conjunct that gates all four arms ensures an Application only passes after Argo CD has completed at least one sync operation.

# Before — only arms 3, 4 required Succeeded
((Synced && Healthy) || (Synced && Progressing) || (OutOfSync && Healthy && Succeeded) || (Synced && Degraded && Succeeded)): true

# After — all four arms require Succeeded
(Succeeded && ((Synced && Healthy) || (Synced && Progressing) || (OutOfSync && Healthy) || (Synced && Degraded))): true

Why this isn't too strict. Within the existing 5-minute assertion timeout, Argo CD's automated sync completes well within the budget (typical KWOK convergence: 30-90 s including child sync). The four sync/health combinations the old bash gate accepted remain valid; only the "pre-sync transient" window is now rejected.

Workflow path filter. .github/workflows/kwok-recipes.yaml previously gated KWOK on recipes/**, kwok/**, the workflow file, and .github/actions/kwok-test/**. PR #1050 introduced tests/chainsaw/kwok/ as the new home for the migrated argocd / flux chainsaw scenarios, but didn't expand the filter — so edits to the scenarios silently skipped KWOK CI. The first push of this PR exhibited the symptom (0 KWOK jobs scheduled, caught by Codex review). Added tests/chainsaw/kwok/** to both push.paths and pull_request.paths.

Flux sibling not affected. tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml asserts Ready=True conditions on HelmRelease and Kustomization, which already require their controllers to have completed reconciliation. No parallel change needed.

Comment block updated to explain why all four arms now share the operationState.phase requirement.

Testing

yamllint tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml — clean
chainsaw lint test -f tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml — "The document is valid"
yamllint .github/workflows/kwok-recipes.yaml — clean
KWOK validation in CI is the canonical proof: with the path-filter fix in place, this PR now triggers the full 68-job Tier 1 matrix. If the original race survives, the same 5 pods total, 3 unscheduled signature reappears; if the fix lands, the assertion converges normally or times out at 5 m with a clear "child apps not Succeeded" diagnostic.
make qualify was not run end-to-end: change is test-infrastructure-only (no Go, no recipes, no docs-sidebar impact), and the relevant gates (yamllint, chainsaw lint) pass.

Risk Assessment

Low — Single-file YAML predicate change + workflow path-filter expansion. The assertion still accepts all the same sync/health combinations as before, just gated behind a real terminal state. Worst case if the predicate is too strict: assertion times out at 5 m on a real-but-slow convergence rather than racing past it. Easy revert.

Rollout notes: No flag, no opt-in. The next KWOK run after merge exercises the new predicate. Existing passing PRs are unaffected. The path-filter expansion is purely additive.

Follow-up

Codex review noted that the old bash gate had an explicit child-Application count guard, while this fix uses root-sync success as the materialization proxy. The proxy is sufficient for the reported flake (root cannot reach Succeeded without having created children), but a follow-up "wait for root.status.resources non-empty" step would restore full pre-#1050 parity for cases where the count itself is the invariant. Tracked for a follow-up if reviewers want belt-and-suspenders.

Checklist

Tests pass locally (make test with -race) — N/A (no Go change)
Linter passes (make lint) — yamllint clean on both touched files, chainsaw lint valid
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (the change IS in the test infrastructure)
I updated docs if user-facing behavior changed — N/A (no user-facing surface)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-27T16:48:30Z

📝 Walkthrough

Walkthrough

This PR gates all four Chainsaw "pass" arms for Argo CD Applications on operationState.phase == 'Succeeded' by moving that condition to an outer conjunction in the JMESPath predicate, and updates inline comments to match. It also expands the .github/workflows/kwok-recipes.yaml paths filters so changes under tests/chainsaw/kwok/** will trigger the KWOK Cluster Validation workflow on push and pull_request.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/aicr#1050: Introduced the Chainsaw-based kwok-argocd-sync gate that this PR adjusts.

Suggested labels

area/ci

Suggested reviewers

lalitadithya

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: requiring Succeeded operationState in the argocd sync gate predicate to fix a race condition.
Linked Issues check	✅ Passed	The PR implements the minimum-invasive fix proposed in issue `#1061` by requiring operationState.phase == 'Succeeded' for all four arms of the pass predicate, which directly addresses the race condition.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the race: the chainsaw-test.yaml predicate fix and the workflow paths filter update are both necessary for the fix.
Description check	✅ Passed	The pull request description clearly relates to the changeset: it explains the JMESPath predicate tightening in chainsaw-test.yaml and the workflow path-filter expansion, with detailed motivation (race condition evidence), implementation details, and testing performed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…IA#1061) The chainsaw assertion at `assert-all-applications-pass` could pass within ~4-5 seconds of helm install, before Argo CD had completed the initial App-of-Apps sync that creates the child Applications. With only the root Application present in the argocd namespace and the predicate satisfied for that root alone, the assertion returned PASS while the cluster was still half-deployed. The subsequent pod-scheduling check in `validate-scheduling.sh` then tripped on KIND-default kube-system pods (coredns, local-path-provisioner) that the AICR bundle's pods would otherwise have rendered insignificant. Same-SHA evidence: run 26523455927 on `8b5ef744` failed on first attempt and passed on retry of identical content — a textbook race-condition signature. The pre-NVIDIA#1050 bash gate avoided the race because its polling loop took long enough that children materialized before the next re-list. Tighten the JMESPath predicate so all four arms require `operationState.phase == 'Succeeded'`. An Application can only count as passing after Argo CD has completed at least one sync operation; for the App-of-Apps that's the gate ensuring the children exist and join the selector's match set on subsequent polls. The four sync/health combinations remain (canonical pass + the three KWOK/operator-mutation relaxations the old bash gate accepted), so legitimate Application states still resolve to PASS once Argo CD has actually run. The sibling flux-sync chainsaw test does not need a parallel change: it asserts `Ready=True` conditions on HelmRelease and Kustomization, which already require their controllers to have completed reconciliation. Also add `tests/chainsaw/kwok/**` to the `KWOK Cluster Validation` workflow path filters. PR NVIDIA#1050 placed the new chainsaw scenarios under `tests/chainsaw/kwok/`, but the workflow filters were not updated to match — so changes to the scenarios silently skipped KWOK CI. Now any edit to the chainsaw KWOK scenarios re-runs the matrix automatically. Fixes NVIDIA#1061 Related NVIDIA#1050 (the chainsaw migration that introduced the race)

mchmarny

Tight, focused fix. The predicate change is the right shape: factoring operationState.phase == 'Succeeded' out as a top-level conjunct makes all four arms require a completed sync operation, which rejects the pre-sync window where the App-of-Apps root can briefly satisfy Synced before children are materialized. Boolean check holds (A ∨ B ∨ (C∧S) ∨ (D∧S) → S ∧ (A ∨ B ∨ C ∨ D) is strictly stricter), and JMESPath fails closed on missing operationState so there's no surprise behavior on the cold path.

Path-filter addition is the other half of the fix — caught by Codex review on the first push, now correctly added to both push and pull_request paths so future chainsaw scenario edits actually re-trigger KWOK CI.

Comment block is also a notable upgrade — explains the race, references #1061 and #1050, and keeps the arms table readable after pulling the per-arm op.phase notes into the unified conjunct.

CI green (101 checks). "blocked" mergeable state is required-reviewer, not a CI signal. The status.resources non-empty follow-up the author flagged is a sensible belt-and-suspenders that doesn't block this PR.

verify_pods() in kwok/scripts/validate-scheduling.sh issued five separate `kubectl get pods` calls — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between those calls, operator controllers (gpu-operator, dynamo, dra-driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: arithmetic that cannot come from any single Kubernetes state: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl prints "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) ... `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 can only happen across two different state samples. By the time the diagnostic re-query ran, the bundle pod had already been bound by the scheduler, so the listing came up empty; the unfiltered event dump then surfaced kube-system FailedScheduling noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node, misleading at least two prior root-cause analyses (issue NVIDIA#1090 has the full evidence trail across jobs 78343519974 and 78341110233). This change makes verify_pods deterministic by: 1. Taking ONE `kubectl get pods --all-namespaces -o json` snapshot at the top of the function and deriving all five counts plus the pod distribution from it via jq. 2. Rendering the "Unscheduled pods:" diagnostic from the same snapshot — never prints "No resources found" when the count is positive, because they're computed from one source of truth. 3. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count (kube-system, kube-node-lease, kube-public, local-path-storage, kwok-system, argocd, flux-system, aicr-registry). Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 4. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 in that case and flowed through the "bundle may be empty" PASS branch — so an apiserver outage, auth/RBAC failure, or bad kube context would be reported as a scheduling pass. The rewrite captures stderr so the failure mode is visible in the job log, then returns 1. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - The scheduler genuinely not having bound a pod yet at the moment the snapshot was taken. The single-snapshot view will still flag that as a failure, which is correct. An optional re-snapshot-once retry was considered and deferred to keep this change minimal — see NVIDIA#1090 acceptance criteria. Smoke-tested locally with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090

NVIDIA#1090) verify_pods() in kwok/scripts/validate-scheduling.sh had two flake classes that produced false failures on the KWOK Tier-1 lanes: 1. TOCTOU race across multiple `kubectl get pods` calls. The previous implementation issued five separate queries — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between calls, operator controllers (gpu-operator, dynamo, dra- driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl returned "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 is impossible in a single state sample. The transient pod had been bound by the time the diagnostic re-queried, and the unfiltered event dump then surfaced kube-system noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node — misleading at least two prior root-cause analyses. 2. Post-sync-gate scheduler latency. After Argo CD / Flux reports Synced+Healthy, controllers continue creating pods that have not yet been bound. A single coherent snapshot taken in that window correctly reports the pre-bind transient as a failure (verified on the eks-inference/argocd-oci and oke-training/argocd-oci lanes — cert-manager-786cc54854-kngvm and skyhook-operator-controller- manager-596b99b857-8shfm respectively, both with zero FailedScheduling events, meaning the scheduler simply had not gotten to them yet). This change makes verify_pods deterministic by: 1. Taking one `kubectl get pods --all-namespaces -o json` snapshot per attempt and deriving all five counts plus the pod distribution from it via jq. Within an attempt, the math is guaranteed coherent — the impossible-pending/unscheduled signature cannot reappear. 2. Re-snapshotting on a bounded poll loop (default 60s deadline at 5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE / KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop exits early as soon as a snapshot shows unscheduled == 0 (or any Failed pod, which is terminal). Only declares failure when the deadline is reached AND unscheduled is still positive. 3. Rendering the "Unscheduled pods:" diagnostic and pod distribution from the same (final) snapshot — never prints "No resources found" when the count is positive. 4. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count. Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. The rewrite captures kubectl's stderr so the failure mode appears in the job log, then returns 1. 6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - A genuinely unschedulable pod (missing toleration, no matching nodeSelector, insufficient resources). Those persist through the full deadline and are reported with the same diagnostic as before — now correctly filtered for namespace noise. Locally smoke-tested with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090

yuanchen8911 requested a review from a team as a code owner May 27, 2026 16:45

yuanchen8911 added bug area/ci area/tests labels May 27, 2026

github-actions Bot added size/S and removed area/ci labels May 27, 2026

yuanchen8911 force-pushed the fix/kwok-argocd-sync-gate-race-1061 branch from 93eb4d6 to 569d514 Compare May 27, 2026 16:52

yuanchen8911 requested a review from a team as a code owner May 27, 2026 16:52

github-actions Bot added the area/ci label May 27, 2026

yuanchen8911 requested a review from mchmarny May 27, 2026 16:55

mchmarny approved these changes May 27, 2026

View reviewed changes

mchmarny merged commit 5f72700 into NVIDIA:main May 27, 2026
102 checks passed

This was referenced May 28, 2026

bug(kwok): verify_pods samples pod state across 5 kubectl calls — TOCTOU race produces impossible math + misleading diagnostic #1090

Closed

fix(kwok): single-snapshot verify_pods + settle retry (#1090) #1092

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(kwok): require Succeeded operationState in argocd sync gate#1062

fix(kwok): require Succeeded operationState in argocd sync gate#1062
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-sync-gate-race-1061

yuanchen8911 commented May 27, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Follow-up

Checklist

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 27, 2026 •

edited

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading