fix(kwok): single-snapshot verify_pods + settle retry (#1090) by yuanchen8911 · Pull Request #1092 · NVIDIA/aicr

yuanchen8911 · 2026-05-28T17:40:44Z

Summary

Rewrites verify_pods() in kwok/scripts/validate-scheduling.sh to take one coherent kubectl get pods -o json snapshot per attempt, polls on a bounded scheduler-settle window when pods are still unscheduled, and derives every count, the pod distribution, and the failure diagnostic from the same snapshot — eliminating both the TOCTOU race (impossible math) and the post-sync-gate scheduler-latency flake.

Motivation / Context

verify_pods() had two flake classes producing false failures on KWOK Tier-1 lanes:

1. TOCTOU race across five separate kubectl get pods calls. total / pending / failed / running / unscheduled were sampled sequentially over several seconds (the unscheduled query alone takes hundreds of ms due to -o json | jq). Between calls, operator controllers (gpu-operator, dynamo, dra-driver, …) kept creating pods, so the snapshots disagreed:

Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
ERROR: Unscheduled pods:
       (kubectl returned "No resources found")
ERROR: Events for unscheduled pods:
  kube-system  coredns-…  FailedScheduling  untolerated taint(s)

unscheduled is a subset of pending, so pending=0 + unscheduled=1 cannot come from one state. The unfiltered event dump then surfaced kube-system noise from the intentional nfd-excluded=true:NoSchedule taint, misleading multiple prior reviews.

2. Post-sync-gate scheduler latency. Even after switching to a single coherent snapshot, two lanes failed with internally-consistent math but a real Pending pod that simply hadn't been bound yet:

eks-inference (argocd-oci): cert-manager/cert-manager-786cc54854-kngvm — Pod status: 2 total, 0 running, 2 pending (1 unscheduled), 0 failed, zero FailedScheduling events
oke-training (argocd-oci): skyhook/skyhook-operator-controller-manager-596b99b857-8shfm — Pod status: 12 total, 11 running, 1 pending (1 unscheduled), 0 failed, zero FailedScheduling events

Zero FailedScheduling events means the scheduler hadn't tried to schedule yet — pure latency between Argo CD reporting Synced+Healthy and the scheduler catching up. A single instantaneous snapshot would fail-fast on this transient.

Issue #1090 has the full evidence trail (jobs 78343519974, 78341110233 for the TOCTOU race; jobs 78353433292, 78353436030 for the post-sync latency).

Fixes: #1090
Related: #1061 (closed, sibling sync-gate race fixed by #1062 — distinct from this one)

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Other: KWOK CI harness (kwok/scripts/validate-scheduling.sh)

Implementation Notes

Six changes inside verify_pods():

Single snapshot per attempt. One kubectl get pods --all-namespaces -o json per loop iteration; all five counts (total / running / pending / failed / unscheduled) plus pod distribution derived from it via jq. Within one iteration the math is guaranteed coherent — impossible-math signatures cannot reappear.
Bounded scheduler-settle poll loop. When unscheduled_pods > 0, the loop re-snapshots up to KWOK_VERIFY_PODS_DEADLINE seconds (default 60) at KWOK_VERIFY_PODS_INTERVAL spacing (default 5). Early-exits as soon as a snapshot shows unscheduled == 0 (or any Failed > 0, which is terminal). Declares failure only if the deadline elapses with unscheduled still positive — uses the FINAL snapshot for diagnostics so the named pod is the one that was still unscheduled at the deadline.
Diagnostics from the same snapshot. "Unscheduled pods:" listing and pod distribution come from the final snapshot — never prints "No resources found" when the count is positive.
Filtered FailedScheduling event dump. Piped through grep -vE "^(${exclude_ns})\s" so kube-system / local-path-storage noise from the intentional nfd-excluded taint stops surfacing as if it were the cause.
Fail-closed on kubectl error. Pre-existing 2>/dev/null | wc -l silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. Now captures kubectl's stderr into a mktemp file and surfaces it in log_error before return 1.
(Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency.

Tunables

Env var	Default	Purpose
`KWOK_VERIFY_PODS_DEADLINE`	`60`	Total seconds to wait for the scheduler to bind remaining Pending pods before declaring failure.
`KWOK_VERIFY_PODS_INTERVAL`	`5`	Seconds between re-snapshots within the deadline window.

Per-iteration cost: 1 kubectl call + 5 jq calls (~0.5s total). With default deadline, max ~13 iterations. Loop exits early on success, so the additional cost is paid only when scheduler latency is real — for hung pods that would have failed anyway, this adds the deadline as a fixed cost (worth it for stability).

What this PR does NOT change

The intentional nfd-excluded=true:NoSchedule taint on the kind control-plane (still applied; only the diagnostic stopped surfacing its noise).
The Argo CD sync-gate behavior from fix(kwok): require Succeeded operationState in argocd sync gate #1062 (separate race, separately fixed).
Recipe overlays or any Go code.
Genuine scheduling failures (missing toleration, no matching nodeSelector, insufficient resources): those persist through the full deadline and are reported with the same diagnostic.

Testing

# Static checks
shellcheck -x kwok/scripts/validate-scheduling.sh   # 0 errors

# Local smoke test with synthetic snapshot
# Covered: excluded namespaces (kube-system, local-path-storage, argocd),
#          Job-owned helm hooks, Failed pods, ContainerCreating pods.

# Project gate
unset GITLAB_TOKEN
make qualify    # 0 issues lint, 22/22 chainsaw, no vulnerabilities, license OK

End-to-end verification: the prior CI run on this PR (commit d702b9e9) hit the post-sync-gate scheduler latency on two argocd-oci lanes — exactly the case the new poll loop addresses. The current revision (e4827765) is the response to that finding. The next KWOK run on this PR is the natural integration test; if any lane still fails, the failure will name a specific pod that remained unscheduled for the full 60s deadline, which is a real bug not a flake.

Coverage: script-only change (no Go), so the per-package coverage gate does not apply per CLAUDE.md.

Risk Assessment

Low — Single-function rewrite in a CI-only harness script. Success-path log unchanged on the happy path; failure-path output strictly more useful. The poll loop adds bounded latency (default 60s max) only when pods are not yet scheduled — already-scheduled bundles see zero added latency. Easy to revert (one file, one commit).

Rollout notes: None. Takes effect on first run after merge. No flag, no migration. The two new env vars default to the values used during development; operators only need to set them if they want a shorter/longer settle window.

Checklist

Tests pass locally (make test with -race) — covered by make qualify
Linter passes (make lint) — 0 issues
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A, no Go path; smoke-tested locally and end-to-end on the prior PR commit
I updated docs if user-facing behavior changed — N/A, CI internals only
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

coderabbitai · 2026-05-28T17:44:07Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T17:44:10Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: cc8e5bf7-d511-45c4-abea-cf45f4a47cd8

📥 Commits

Reviewing files that changed from the base of the PR and between c1a1b56 and bae72e2.

📒 Files selected for processing (1)

kwok/scripts/validate-scheduling.sh

📝 Walkthrough

Walkthrough

The PR refactors kwok/scripts/validate-scheduling.sh::verify_pods() to snapshot all pods once per retry via kubectl get pods --all-namespaces -o json, validate KWOK_VERIFY_PODS_DEADLINE/INTERVAL, and compute total/running/pending/failed/truly-unscheduled counts plus per-node distribution from that snapshot using jq. Unscheduled detection is tightened to Pending pods with no spec.nodeName excluding Job-owned pods. On failure, unscheduled pod lists, a capped FailedScheduling event dump, and Failed-pod reason/messages are all produced from the same snapshot and use a shared namespace exclusion filter.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

bug(kwok): verify_pods samples pod state across 5 kubectl calls — TOCTOU race produces impossible math + misleading diagnostic #1090: Same change to kwok/scripts/validate-scheduling.sh::verify_pods — consolidate pod sampling into a single JSON snapshot and use jq-driven diagnostics and filtered events.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix(kwok): single-snapshot verify_pods + settle retry (`#1090`)' directly describes the main changes: a fix to use single snapshot per iteration and adding a settle/retry loop.
Description check	✅ Passed	The description comprehensively details the problem (TOCTOU race and scheduler latency), the solution (single snapshot per iteration, bounded retry loop), implementation notes, testing, and risk assessment—directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-28T17:53:00Z

Actionable comments posted: 0

mchmarny

Solid fix. The single-snapshot pattern eliminates the impossible-math signature from #1090 by construction, and the bounded settle loop addresses the post-sync scheduler latency observed on eks-inference (argocd-oci) / oke-training (argocd-oci) without adding latency on the happy path. CI is green across all 22 KWOK Tier-1 lanes on this revision.

Two nits on the loop machinery (neither blocking), plus props on the comment density — the inline rationale is going to save the next person a lot of root-cause time.

coderabbitai · 2026-05-28T18:54:36Z

Actionable comments posted: 0

NVIDIA#1090) verify_pods() in kwok/scripts/validate-scheduling.sh had two flake classes that produced false failures on the KWOK Tier-1 lanes: 1. TOCTOU race across multiple `kubectl get pods` calls. The previous implementation issued five separate queries — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between calls, operator controllers (gpu-operator, dynamo, dra- driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl returned "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 is impossible in a single state sample. The transient pod had been bound by the time the diagnostic re-queried, and the unfiltered event dump then surfaced kube-system noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node — misleading at least two prior root-cause analyses. 2. Post-sync-gate scheduler latency. After Argo CD / Flux reports Synced+Healthy, controllers continue creating pods that have not yet been bound. A single coherent snapshot taken in that window correctly reports the pre-bind transient as a failure (verified on the eks-inference/argocd-oci and oke-training/argocd-oci lanes — cert-manager-786cc54854-kngvm and skyhook-operator-controller- manager-596b99b857-8shfm respectively, both with zero FailedScheduling events, meaning the scheduler simply had not gotten to them yet). This change makes verify_pods deterministic by: 1. Taking one `kubectl get pods --all-namespaces -o json` snapshot per attempt and deriving all five counts plus the pod distribution from it via jq. Within an attempt, the math is guaranteed coherent — the impossible-pending/unscheduled signature cannot reappear. 2. Re-snapshotting on a bounded poll loop (default 60s deadline at 5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE / KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop exits early as soon as a snapshot shows unscheduled == 0 (or any Failed pod, which is terminal). Only declares failure when the deadline is reached AND unscheduled is still positive. 3. Rendering the "Unscheduled pods:" diagnostic and pod distribution from the same (final) snapshot — never prints "No resources found" when the count is positive. 4. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count. Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. The rewrite captures kubectl's stderr so the failure mode appears in the job log, then returns 1. 6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - A genuinely unschedulable pod (missing toleration, no matching nodeSelector, insufficient resources). Those persist through the full deadline and are reported with the same diagnostic as before — now correctly filtered for namespace noise. Locally smoke-tested with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090

coderabbitai · 2026-05-28T19:00:32Z

Actionable comments posted: 0

yuanchen8911 requested a review from a team as a code owner May 28, 2026 17:40

yuanchen8911 added area/ci area/tests bug labels May 28, 2026

github-actions Bot added size/M and removed area/ci area/tests labels May 28, 2026

yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from 223f847 to d702b9e Compare May 28, 2026 17:49

yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from d702b9e to e482776 Compare May 28, 2026 18:09

yuanchen8911 changed the title ~~fix(kwok): single-snapshot verify_pods to stop TOCTOU race (#1090)~~ fix(kwok): single-snapshot verify_pods + settle retry (#1090) May 28, 2026

yuanchen8911 requested review from lockwobr and mchmarny May 28, 2026 18:20

yuanchen8911 marked this pull request as draft May 28, 2026 18:46

mchmarny previously approved these changes May 28, 2026

View reviewed changes

Comment thread kwok/scripts/validate-scheduling.sh

Comment thread kwok/scripts/validate-scheduling.sh Outdated

yuanchen8911 dismissed mchmarny’s stale review via c1a1b56 May 28, 2026 18:50

yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from e482776 to c1a1b56 Compare May 28, 2026 18:50

github-actions Bot added size/L and removed size/M labels May 28, 2026

yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from c1a1b56 to bae72e2 Compare May 28, 2026 18:55

yuanchen8911 requested a review from mchmarny May 28, 2026 18:56

yuanchen8911 marked this pull request as ready for review May 28, 2026 18:58

mchmarny approved these changes May 28, 2026

View reviewed changes

mchmarny merged commit 216c76c into NVIDIA:main May 28, 2026
114 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(kwok): single-snapshot verify_pods + settle retry (#1090)#1092

fix(kwok): single-snapshot verify_pods + settle retry (#1090)#1092
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-verify-pods-race-1090

yuanchen8911 commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Tunables

What this PR does NOT change

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related issues

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading