Skip to content

fix(kwok): require Succeeded operationState in argocd sync gate#1062

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-sync-gate-race-1061
May 27, 2026
Merged

fix(kwok): require Succeeded operationState in argocd sync gate#1062
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-argocd-sync-gate-race-1061

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented May 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Tighten the argocd-* KWOK chainsaw sync gate so all four arms of the pass-state predicate require operationState.phase == 'Succeeded'. Removes a race in which the assertion could PASS within ~4-5 s of helm install — before Argo CD had completed its initial App-of-Apps sync and created the child Applications.

Also add tests/chainsaw/kwok/** to the KWOK Cluster Validation workflow path filters so the scenarios actually re-run on edit (per Codex review of the first push).

Motivation / Context

The chainsaw assertion at tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml says "every Application in the argocd namespace must satisfy the pass predicate." Right after helm install of an argocd-helm-oci bundle, the only Application present is the root App-of-Apps. The controller can briefly set the root's sync.status=Synced (manifest matches cluster) before running the sync operation that creates the children. In that window the assertion sees a single matching Application that satisfies arm 1 or arm 2 and returns PASS — children haven't been created yet, so they aren't in the match set.

Pre-#1050, the bash wait_for_argocd_sync loop avoided the race accidentally: its polling overhead took long enough that children materialized between iterations and joined the pass-check by the time it re-listed.

Evidence the race is real (issue #1061 has the full table):

Run SHA Outcome
26523455927 (initial) 8b5ef744
26523455927 (re-run) 8b5ef744 (same)

Same commit, different result — textbook race-condition signature. Per-job failure rate is low (~1-3%), but with 68 parallel Tier 1 jobs the workflow surfaces it on a meaningful fraction of runs, costing CI minutes and re-runs.

Fixes: #1061
Related: #1050 (the chainsaw migration that introduced the race), #962 (the parent tracking issue)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Build/CI/tooling (.github/workflows/kwok-recipes.yaml path filter expansion)

Component(s) Affected

  • Other: KWOK CI scenarios (tests/chainsaw/kwok/argocd-sync/) and workflow path filters

Implementation Notes

Predicate change. The JMESPath OR previously required operationState.phase == 'Succeeded' only on arms 3 and 4 (the relaxations for operator mutation and health-controller divergence). Arms 1 and 2 (Synced + Healthy / Synced + Progressing) could match purely on sync.status + health.status, which is what the App-of-Apps root briefly satisfies before its sync operation runs. Moving the operationState.phase guard out as a conjunct that gates all four arms ensures an Application only passes after Argo CD has completed at least one sync operation.

# Before — only arms 3, 4 required Succeeded
((Synced && Healthy) || (Synced && Progressing) || (OutOfSync && Healthy && Succeeded) || (Synced && Degraded && Succeeded)): true

# After — all four arms require Succeeded
(Succeeded && ((Synced && Healthy) || (Synced && Progressing) || (OutOfSync && Healthy) || (Synced && Degraded))): true

Why this isn't too strict. Within the existing 5-minute assertion timeout, Argo CD's automated sync completes well within the budget (typical KWOK convergence: 30-90 s including child sync). The four sync/health combinations the old bash gate accepted remain valid; only the "pre-sync transient" window is now rejected.

Workflow path filter. .github/workflows/kwok-recipes.yaml previously gated KWOK on recipes/**, kwok/**, the workflow file, and .github/actions/kwok-test/**. PR #1050 introduced tests/chainsaw/kwok/ as the new home for the migrated argocd / flux chainsaw scenarios, but didn't expand the filter — so edits to the scenarios silently skipped KWOK CI. The first push of this PR exhibited the symptom (0 KWOK jobs scheduled, caught by Codex review). Added tests/chainsaw/kwok/** to both push.paths and pull_request.paths.

Flux sibling not affected. tests/chainsaw/kwok/flux-sync/chainsaw-test.yaml asserts Ready=True conditions on HelmRelease and Kustomization, which already require their controllers to have completed reconciliation. No parallel change needed.

Comment block updated to explain why all four arms now share the operationState.phase requirement.

Testing

  • yamllint tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml — clean
  • chainsaw lint test -f tests/chainsaw/kwok/argocd-sync/chainsaw-test.yaml — "The document is valid"
  • yamllint .github/workflows/kwok-recipes.yaml — clean
  • KWOK validation in CI is the canonical proof: with the path-filter fix in place, this PR now triggers the full 68-job Tier 1 matrix. If the original race survives, the same 5 pods total, 3 unscheduled signature reappears; if the fix lands, the assertion converges normally or times out at 5 m with a clear "child apps not Succeeded" diagnostic.
  • make qualify was not run end-to-end: change is test-infrastructure-only (no Go, no recipes, no docs-sidebar impact), and the relevant gates (yamllint, chainsaw lint) pass.

Risk Assessment

  • Low — Single-file YAML predicate change + workflow path-filter expansion. The assertion still accepts all the same sync/health combinations as before, just gated behind a real terminal state. Worst case if the predicate is too strict: assertion times out at 5 m on a real-but-slow convergence rather than racing past it. Easy revert.

Rollout notes: No flag, no opt-in. The next KWOK run after merge exercises the new predicate. Existing passing PRs are unaffected. The path-filter expansion is purely additive.

Follow-up

Codex review noted that the old bash gate had an explicit child-Application count guard, while this fix uses root-sync success as the materialization proxy. The proxy is sufficient for the reported flake (root cannot reach Succeeded without having created children), but a follow-up "wait for root.status.resources non-empty" step would restore full pre-#1050 parity for cases where the count itself is the invariant. Tracked for a follow-up if reviewers want belt-and-suspenders.

Checklist

  • Tests pass locally (make test with -race) — N/A (no Go change)
  • Linter passes (make lint) — yamllint clean on both touched files, chainsaw lint valid
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (the change IS in the test infrastructure)
  • I updated docs if user-facing behavior changed — N/A (no user-facing surface)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR gates all four Chainsaw "pass" arms for Argo CD Applications on operationState.phase == 'Succeeded' by moving that condition to an outer conjunction in the JMESPath predicate, and updates inline comments to match. It also expands the .github/workflows/kwok-recipes.yaml paths filters so changes under tests/chainsaw/kwok/** will trigger the KWOK Cluster Validation workflow on push and pull_request.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1050: Introduced the Chainsaw-based kwok-argocd-sync gate that this PR adjusts.

Suggested labels

area/ci

Suggested reviewers

  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: requiring Succeeded operationState in the argocd sync gate predicate to fix a race condition.
Linked Issues check ✅ Passed The PR implements the minimum-invasive fix proposed in issue #1061 by requiring operationState.phase == 'Succeeded' for all four arms of the pass predicate, which directly addresses the race condition.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the race: the chainsaw-test.yaml predicate fix and the workflow paths filter update are both necessary for the fix.
Description check ✅ Passed The pull request description clearly relates to the changeset: it explains the JMESPath predicate tightening in chainsaw-test.yaml and the workflow path-filter expansion, with detailed motivation (race condition evidence), implementation details, and testing performed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…IA#1061)

The chainsaw assertion at `assert-all-applications-pass` could pass
within ~4-5 seconds of helm install, before Argo CD had completed the
initial App-of-Apps sync that creates the child Applications. With only
the root Application present in the argocd namespace and the predicate
satisfied for that root alone, the assertion returned PASS while the
cluster was still half-deployed. The subsequent pod-scheduling check in
`validate-scheduling.sh` then tripped on KIND-default kube-system pods
(coredns, local-path-provisioner) that the AICR bundle's pods would
otherwise have rendered insignificant.

Same-SHA evidence: run 26523455927 on `8b5ef744` failed on first attempt
and passed on retry of identical content — a textbook race-condition
signature. The pre-NVIDIA#1050 bash gate avoided the race because its polling
loop took long enough that children materialized before the next
re-list.

Tighten the JMESPath predicate so all four arms require
`operationState.phase == 'Succeeded'`. An Application can only count as
passing after Argo CD has completed at least one sync operation; for the
App-of-Apps that's the gate ensuring the children exist and join the
selector's match set on subsequent polls. The four sync/health
combinations remain (canonical pass + the three KWOK/operator-mutation
relaxations the old bash gate accepted), so legitimate Application
states still resolve to PASS once Argo CD has actually run.

The sibling flux-sync chainsaw test does not need a parallel change: it
asserts `Ready=True` conditions on HelmRelease and Kustomization, which
already require their controllers to have completed reconciliation.

Also add `tests/chainsaw/kwok/**` to the `KWOK Cluster Validation`
workflow path filters. PR NVIDIA#1050 placed the new chainsaw scenarios under
`tests/chainsaw/kwok/`, but the workflow filters were not updated to
match — so changes to the scenarios silently skipped KWOK CI. Now any
edit to the chainsaw KWOK scenarios re-runs the matrix automatically.

Fixes NVIDIA#1061
Related NVIDIA#1050 (the chainsaw migration that introduced the race)
@yuanchen8911 yuanchen8911 force-pushed the fix/kwok-argocd-sync-gate-race-1061 branch from 93eb4d6 to 569d514 Compare May 27, 2026 16:52
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner May 27, 2026 16:52
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 27, 2026 16:55

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tight, focused fix. The predicate change is the right shape: factoring operationState.phase == 'Succeeded' out as a top-level conjunct makes all four arms require a completed sync operation, which rejects the pre-sync window where the App-of-Apps root can briefly satisfy Synced before children are materialized. Boolean check holds (A ∨ B ∨ (C∧S) ∨ (D∧S)S ∧ (A ∨ B ∨ C ∨ D) is strictly stricter), and JMESPath fails closed on missing operationState so there's no surprise behavior on the cold path.

Path-filter addition is the other half of the fix — caught by Codex review on the first push, now correctly added to both push and pull_request paths so future chainsaw scenario edits actually re-trigger KWOK CI.

Comment block is also a notable upgrade — explains the race, references #1061 and #1050, and keeps the arms table readable after pulling the per-arm op.phase notes into the unified conjunct.

CI green (101 checks). "blocked" mergeable state is required-reviewer, not a CI signal. The status.resources non-empty follow-up the author flagged is a sensible belt-and-suspenders that doesn't block this PR.

@mchmarny mchmarny merged commit 5f72700 into NVIDIA:main May 27, 2026
102 checks passed
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
verify_pods() in kwok/scripts/validate-scheduling.sh issued five
separate `kubectl get pods` calls — total / pending / failed / running /
unscheduled — over several seconds (the unscheduled query alone takes
hundreds of ms due to `-o json | jq`). Between those calls, operator
controllers (gpu-operator, dynamo, dra-driver, cert-manager,
prometheus, …) kept reconciling and creating new pods, so the
snapshots disagreed with each other.

Symptom: arithmetic that cannot come from any single Kubernetes state:

  Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
  ERROR: Unscheduled pods:
         (kubectl prints "No resources found")
  ERROR: Events for unscheduled pods:
    kube-system  coredns-…  FailedScheduling  untolerated taint(s)
    ...

`unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1
can only happen across two different state samples. By the time the
diagnostic re-query ran, the bundle pod had already been bound by the
scheduler, so the listing came up empty; the unfiltered event dump
then surfaced kube-system FailedScheduling noise from the intentional
`nfd-excluded=true:NoSchedule` taint on the kind control-plane node,
misleading at least two prior root-cause analyses (issue NVIDIA#1090 has
the full evidence trail across jobs 78343519974 and 78341110233).

This change makes verify_pods deterministic by:

  1. Taking ONE `kubectl get pods --all-namespaces -o json` snapshot
     at the top of the function and deriving all five counts plus the
     pod distribution from it via jq.
  2. Rendering the "Unscheduled pods:" diagnostic from the same
     snapshot — never prints "No resources found" when the count is
     positive, because they're computed from one source of truth.
  3. Filtering the FailedScheduling event dump with the same namespace
     exclusions used by the count (kube-system, kube-node-lease,
     kube-public, local-path-storage, kwok-system, argocd, flux-system,
     aicr-registry). Persistent harness noise from the intentional
     nfd-excluded taint no longer surfaces as if it were the cause.
  4. (Drive-by, same shape) Rendering the failed-pod diagnostic from
     the same snapshot for consistency.
  5. Failing closed when `kubectl get pods` itself errors. The pre-
     existing `2>/dev/null | wc -l` chain silently produced
     total_pods=0 in that case and flowed through the "bundle may be
     empty" PASS branch — so an apiserver outage, auth/RBAC failure,
     or bad kube context would be reported as a scheduling pass. The
     rewrite captures stderr so the failure mode is visible in the
     job log, then returns 1.

Race classes NVIDIA#1090 explicitly does NOT solve:
  - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062).
  - The scheduler genuinely not having bound a pod yet at the moment
    the snapshot was taken. The single-snapshot view will still flag
    that as a failure, which is correct. An optional re-snapshot-once
    retry was considered and deferred to keep this change minimal —
    see NVIDIA#1090 acceptance criteria.

Smoke-tested locally with a synthetic snapshot covering excluded
namespaces, Job-owned helm hooks, and Failed pods. `make qualify`
clean (22/22 chainsaw, no vulns, lint 0 issues).

Fixes: NVIDIA#1090
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
NVIDIA#1090)

verify_pods() in kwok/scripts/validate-scheduling.sh had two flake
classes that produced false failures on the KWOK Tier-1 lanes:

1. TOCTOU race across multiple `kubectl get pods` calls. The previous
   implementation issued five separate queries — total / pending /
   failed / running / unscheduled — over several seconds (the
   unscheduled query alone takes hundreds of ms due to `-o json | jq`).
   Between calls, operator controllers (gpu-operator, dynamo, dra-
   driver, cert-manager, prometheus, …) kept reconciling and creating
   new pods, so the snapshots disagreed with each other. Symptom:

     Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
     ERROR: Unscheduled pods:
            (kubectl returned "No resources found")
     ERROR: Events for unscheduled pods:
       kube-system  coredns-…  FailedScheduling  untolerated taint(s)

   `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1
   is impossible in a single state sample. The transient pod had been
   bound by the time the diagnostic re-queried, and the unfiltered
   event dump then surfaced kube-system noise from the intentional
   `nfd-excluded=true:NoSchedule` taint on the kind control-plane node
   — misleading at least two prior root-cause analyses.

2. Post-sync-gate scheduler latency. After Argo CD / Flux reports
   Synced+Healthy, controllers continue creating pods that have not
   yet been bound. A single coherent snapshot taken in that window
   correctly reports the pre-bind transient as a failure (verified on
   the eks-inference/argocd-oci and oke-training/argocd-oci lanes —
   cert-manager-786cc54854-kngvm and skyhook-operator-controller-
   manager-596b99b857-8shfm respectively, both with zero
   FailedScheduling events, meaning the scheduler simply had not
   gotten to them yet).

This change makes verify_pods deterministic by:

  1. Taking one `kubectl get pods --all-namespaces -o json` snapshot
     per attempt and deriving all five counts plus the pod
     distribution from it via jq. Within an attempt, the math is
     guaranteed coherent — the impossible-pending/unscheduled
     signature cannot reappear.
  2. Re-snapshotting on a bounded poll loop (default 60s deadline at
     5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE /
     KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop
     exits early as soon as a snapshot shows unscheduled == 0 (or
     any Failed pod, which is terminal). Only declares failure when
     the deadline is reached AND unscheduled is still positive.
  3. Rendering the "Unscheduled pods:" diagnostic and pod distribution
     from the same (final) snapshot — never prints "No resources
     found" when the count is positive.
  4. Filtering the FailedScheduling event dump with the same
     namespace exclusions used by the count. Persistent harness noise
     from the intentional nfd-excluded taint no longer surfaces as if
     it were the cause.
  5. Failing closed when `kubectl get pods` itself errors. The pre-
     existing `2>/dev/null | wc -l` chain silently produced
     total_pods=0 on apiserver outage / auth failure / bad context
     and flowed through the "bundle may be empty" PASS branch. The
     rewrite captures kubectl's stderr so the failure mode appears
     in the job log, then returns 1.
  6. (Drive-by, same shape) Rendering the failed-pod diagnostic from
     the same snapshot for consistency.

Race classes NVIDIA#1090 explicitly does NOT solve:
  - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062).
  - A genuinely unschedulable pod (missing toleration, no matching
    nodeSelector, insufficient resources). Those persist through the
    full deadline and are reported with the same diagnostic as before
    — now correctly filtered for namespace noise.

Locally smoke-tested with a synthetic snapshot covering excluded
namespaces, Job-owned helm hooks, and Failed pods. `make qualify`
clean (22/22 chainsaw, no vulns, lint 0 issues).

Fixes: NVIDIA#1090
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
NVIDIA#1090)

verify_pods() in kwok/scripts/validate-scheduling.sh had two flake
classes that produced false failures on the KWOK Tier-1 lanes:

1. TOCTOU race across multiple `kubectl get pods` calls. The previous
   implementation issued five separate queries — total / pending /
   failed / running / unscheduled — over several seconds (the
   unscheduled query alone takes hundreds of ms due to `-o json | jq`).
   Between calls, operator controllers (gpu-operator, dynamo, dra-
   driver, cert-manager, prometheus, …) kept reconciling and creating
   new pods, so the snapshots disagreed with each other. Symptom:

     Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
     ERROR: Unscheduled pods:
            (kubectl returned "No resources found")
     ERROR: Events for unscheduled pods:
       kube-system  coredns-…  FailedScheduling  untolerated taint(s)

   `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1
   is impossible in a single state sample. The transient pod had been
   bound by the time the diagnostic re-queried, and the unfiltered
   event dump then surfaced kube-system noise from the intentional
   `nfd-excluded=true:NoSchedule` taint on the kind control-plane node
   — misleading at least two prior root-cause analyses.

2. Post-sync-gate scheduler latency. After Argo CD / Flux reports
   Synced+Healthy, controllers continue creating pods that have not
   yet been bound. A single coherent snapshot taken in that window
   correctly reports the pre-bind transient as a failure (verified on
   the eks-inference/argocd-oci and oke-training/argocd-oci lanes —
   cert-manager-786cc54854-kngvm and skyhook-operator-controller-
   manager-596b99b857-8shfm respectively, both with zero
   FailedScheduling events, meaning the scheduler simply had not
   gotten to them yet).

This change makes verify_pods deterministic by:

  1. Taking one `kubectl get pods --all-namespaces -o json` snapshot
     per attempt and deriving all five counts plus the pod
     distribution from it via jq. Within an attempt, the math is
     guaranteed coherent — the impossible-pending/unscheduled
     signature cannot reappear.
  2. Re-snapshotting on a bounded poll loop (default 60s deadline at
     5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE /
     KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop
     exits early as soon as a snapshot shows unscheduled == 0 (or
     any Failed pod, which is terminal). Only declares failure when
     the deadline is reached AND unscheduled is still positive.
  3. Rendering the "Unscheduled pods:" diagnostic and pod distribution
     from the same (final) snapshot — never prints "No resources
     found" when the count is positive.
  4. Filtering the FailedScheduling event dump with the same
     namespace exclusions used by the count. Persistent harness noise
     from the intentional nfd-excluded taint no longer surfaces as if
     it were the cause.
  5. Failing closed when `kubectl get pods` itself errors. The pre-
     existing `2>/dev/null | wc -l` chain silently produced
     total_pods=0 on apiserver outage / auth failure / bad context
     and flowed through the "bundle may be empty" PASS branch. The
     rewrite captures kubectl's stderr so the failure mode appears
     in the job log, then returns 1.
  6. (Drive-by, same shape) Rendering the failed-pod diagnostic from
     the same snapshot for consistency.

Race classes NVIDIA#1090 explicitly does NOT solve:
  - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062).
  - A genuinely unschedulable pod (missing toleration, no matching
    nodeSelector, insufficient resources). Those persist through the
    full deadline and are reported with the same diagnostic as before
    — now correctly filtered for namespace noise.

Locally smoke-tested with a synthetic snapshot covering excluded
namespaces, Job-owned helm hooks, and Failed pods. `make qualify`
clean (22/22 chainsaw, no vulns, lint 0 issues).

Fixes: NVIDIA#1090
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request May 28, 2026
NVIDIA#1090)

verify_pods() in kwok/scripts/validate-scheduling.sh had two flake
classes that produced false failures on the KWOK Tier-1 lanes:

1. TOCTOU race across multiple `kubectl get pods` calls. The previous
   implementation issued five separate queries — total / pending /
   failed / running / unscheduled — over several seconds (the
   unscheduled query alone takes hundreds of ms due to `-o json | jq`).
   Between calls, operator controllers (gpu-operator, dynamo, dra-
   driver, cert-manager, prometheus, …) kept reconciling and creating
   new pods, so the snapshots disagreed with each other. Symptom:

     Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
     ERROR: Unscheduled pods:
            (kubectl returned "No resources found")
     ERROR: Events for unscheduled pods:
       kube-system  coredns-…  FailedScheduling  untolerated taint(s)

   `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1
   is impossible in a single state sample. The transient pod had been
   bound by the time the diagnostic re-queried, and the unfiltered
   event dump then surfaced kube-system noise from the intentional
   `nfd-excluded=true:NoSchedule` taint on the kind control-plane node
   — misleading at least two prior root-cause analyses.

2. Post-sync-gate scheduler latency. After Argo CD / Flux reports
   Synced+Healthy, controllers continue creating pods that have not
   yet been bound. A single coherent snapshot taken in that window
   correctly reports the pre-bind transient as a failure (verified on
   the eks-inference/argocd-oci and oke-training/argocd-oci lanes —
   cert-manager-786cc54854-kngvm and skyhook-operator-controller-
   manager-596b99b857-8shfm respectively, both with zero
   FailedScheduling events, meaning the scheduler simply had not
   gotten to them yet).

This change makes verify_pods deterministic by:

  1. Taking one `kubectl get pods --all-namespaces -o json` snapshot
     per attempt and deriving all five counts plus the pod
     distribution from it via jq. Within an attempt, the math is
     guaranteed coherent — the impossible-pending/unscheduled
     signature cannot reappear.
  2. Re-snapshotting on a bounded poll loop (default 60s deadline at
     5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE /
     KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop
     exits early as soon as a snapshot shows unscheduled == 0 (or
     any Failed pod, which is terminal). Only declares failure when
     the deadline is reached AND unscheduled is still positive.
  3. Rendering the "Unscheduled pods:" diagnostic and pod distribution
     from the same (final) snapshot — never prints "No resources
     found" when the count is positive.
  4. Filtering the FailedScheduling event dump with the same
     namespace exclusions used by the count. Persistent harness noise
     from the intentional nfd-excluded taint no longer surfaces as if
     it were the cause.
  5. Failing closed when `kubectl get pods` itself errors. The pre-
     existing `2>/dev/null | wc -l` chain silently produced
     total_pods=0 on apiserver outage / auth failure / bad context
     and flowed through the "bundle may be empty" PASS branch. The
     rewrite captures kubectl's stderr so the failure mode appears
     in the job log, then returns 1.
  6. (Drive-by, same shape) Rendering the failed-pod diagnostic from
     the same snapshot for consistency.

Race classes NVIDIA#1090 explicitly does NOT solve:
  - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062).
  - A genuinely unschedulable pod (missing toleration, no matching
    nodeSelector, insufficient resources). Those persist through the
    full deadline and are reported with the same diagnostic as before
    — now correctly filtered for namespace noise.

Locally smoke-tested with a synthetic snapshot covering excluded
namespaces, Job-owned helm hooks, and Failed pods. `make qualify`
clean (22/22 chainsaw, no vulns, lint 0 issues).

Fixes: NVIDIA#1090
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KWOK argocd-* sync gate races ahead of child Application materialization (regression from #1050)

2 participants