Skip to content

fix(kwok): single-snapshot verify_pods + settle retry (#1090)#1092

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-verify-pods-race-1090
May 28, 2026
Merged

fix(kwok): single-snapshot verify_pods + settle retry (#1090)#1092
mchmarny merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/kwok-verify-pods-race-1090

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Rewrites verify_pods() in kwok/scripts/validate-scheduling.sh to take one coherent kubectl get pods -o json snapshot per attempt, polls on a bounded scheduler-settle window when pods are still unscheduled, and derives every count, the pod distribution, and the failure diagnostic from the same snapshot — eliminating both the TOCTOU race (impossible math) and the post-sync-gate scheduler-latency flake.

Motivation / Context

verify_pods() had two flake classes producing false failures on KWOK Tier-1 lanes:

1. TOCTOU race across five separate kubectl get pods calls. total / pending / failed / running / unscheduled were sampled sequentially over several seconds (the unscheduled query alone takes hundreds of ms due to -o json | jq). Between calls, operator controllers (gpu-operator, dynamo, dra-driver, …) kept creating pods, so the snapshots disagreed:

Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
ERROR: Unscheduled pods:
       (kubectl returned "No resources found")
ERROR: Events for unscheduled pods:
  kube-system  coredns-…  FailedScheduling  untolerated taint(s)

unscheduled is a subset of pending, so pending=0 + unscheduled=1 cannot come from one state. The unfiltered event dump then surfaced kube-system noise from the intentional nfd-excluded=true:NoSchedule taint, misleading multiple prior reviews.

2. Post-sync-gate scheduler latency. Even after switching to a single coherent snapshot, two lanes failed with internally-consistent math but a real Pending pod that simply hadn't been bound yet:

  • eks-inference (argocd-oci): cert-manager/cert-manager-786cc54854-kngvmPod status: 2 total, 0 running, 2 pending (1 unscheduled), 0 failed, zero FailedScheduling events
  • oke-training (argocd-oci): skyhook/skyhook-operator-controller-manager-596b99b857-8shfmPod status: 12 total, 11 running, 1 pending (1 unscheduled), 0 failed, zero FailedScheduling events

Zero FailedScheduling events means the scheduler hadn't tried to schedule yet — pure latency between Argo CD reporting Synced+Healthy and the scheduler catching up. A single instantaneous snapshot would fail-fast on this transient.

Issue #1090 has the full evidence trail (jobs 78343519974, 78341110233 for the TOCTOU race; jobs 78353433292, 78353436030 for the post-sync latency).

Fixes: #1090
Related: #1061 (closed, sibling sync-gate race fixed by #1062 — distinct from this one)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Other: KWOK CI harness (kwok/scripts/validate-scheduling.sh)

Implementation Notes

Six changes inside verify_pods():

  1. Single snapshot per attempt. One kubectl get pods --all-namespaces -o json per loop iteration; all five counts (total / running / pending / failed / unscheduled) plus pod distribution derived from it via jq. Within one iteration the math is guaranteed coherent — impossible-math signatures cannot reappear.

  2. Bounded scheduler-settle poll loop. When unscheduled_pods > 0, the loop re-snapshots up to KWOK_VERIFY_PODS_DEADLINE seconds (default 60) at KWOK_VERIFY_PODS_INTERVAL spacing (default 5). Early-exits as soon as a snapshot shows unscheduled == 0 (or any Failed > 0, which is terminal). Declares failure only if the deadline elapses with unscheduled still positive — uses the FINAL snapshot for diagnostics so the named pod is the one that was still unscheduled at the deadline.

  3. Diagnostics from the same snapshot. "Unscheduled pods:" listing and pod distribution come from the final snapshot — never prints "No resources found" when the count is positive.

  4. Filtered FailedScheduling event dump. Piped through grep -vE "^(${exclude_ns})\s" so kube-system / local-path-storage noise from the intentional nfd-excluded taint stops surfacing as if it were the cause.

  5. Fail-closed on kubectl error. Pre-existing 2>/dev/null | wc -l silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. Now captures kubectl's stderr into a mktemp file and surfaces it in log_error before return 1.

  6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency.

Tunables

Env var Default Purpose
KWOK_VERIFY_PODS_DEADLINE 60 Total seconds to wait for the scheduler to bind remaining Pending pods before declaring failure.
KWOK_VERIFY_PODS_INTERVAL 5 Seconds between re-snapshots within the deadline window.

Per-iteration cost: 1 kubectl call + 5 jq calls (~0.5s total). With default deadline, max ~13 iterations. Loop exits early on success, so the additional cost is paid only when scheduler latency is real — for hung pods that would have failed anyway, this adds the deadline as a fixed cost (worth it for stability).

What this PR does NOT change

  • The intentional nfd-excluded=true:NoSchedule taint on the kind control-plane (still applied; only the diagnostic stopped surfacing its noise).
  • The Argo CD sync-gate behavior from fix(kwok): require Succeeded operationState in argocd sync gate #1062 (separate race, separately fixed).
  • Recipe overlays or any Go code.
  • Genuine scheduling failures (missing toleration, no matching nodeSelector, insufficient resources): those persist through the full deadline and are reported with the same diagnostic.

Testing

# Static checks
shellcheck -x kwok/scripts/validate-scheduling.sh   # 0 errors

# Local smoke test with synthetic snapshot
# Covered: excluded namespaces (kube-system, local-path-storage, argocd),
#          Job-owned helm hooks, Failed pods, ContainerCreating pods.

# Project gate
unset GITLAB_TOKEN
make qualify    # 0 issues lint, 22/22 chainsaw, no vulnerabilities, license OK

End-to-end verification: the prior CI run on this PR (commit d702b9e9) hit the post-sync-gate scheduler latency on two argocd-oci lanes — exactly the case the new poll loop addresses. The current revision (e4827765) is the response to that finding. The next KWOK run on this PR is the natural integration test; if any lane still fails, the failure will name a specific pod that remained unscheduled for the full 60s deadline, which is a real bug not a flake.

Coverage: script-only change (no Go), so the per-package coverage gate does not apply per CLAUDE.md.

Risk Assessment

  • Low — Single-function rewrite in a CI-only harness script. Success-path log unchanged on the happy path; failure-path output strictly more useful. The poll loop adds bounded latency (default 60s max) only when pods are not yet scheduled — already-scheduled bundles see zero added latency. Easy to revert (one file, one commit).

Rollout notes: None. Takes effect on first run after merge. No flag, no migration. The two new env vars default to the values used during development; operators only need to set them if they want a shorter/longer settle window.

Checklist

  • Tests pass locally (make test with -race) — covered by make qualify
  • Linter passes (make lint) — 0 issues
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A, no Go path; smoke-tested locally and end-to-end on the prior PR commit
  • I updated docs if user-facing behavior changed — N/A, CI internals only
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Actionable comments posted: 0

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: cc8e5bf7-d511-45c4-abea-cf45f4a47cd8

📥 Commits

Reviewing files that changed from the base of the PR and between c1a1b56 and bae72e2.

📒 Files selected for processing (1)
  • kwok/scripts/validate-scheduling.sh

📝 Walkthrough

Walkthrough

The PR refactors kwok/scripts/validate-scheduling.sh::verify_pods() to snapshot all pods once per retry via kubectl get pods --all-namespaces -o json, validate KWOK_VERIFY_PODS_DEADLINE/INTERVAL, and compute total/running/pending/failed/truly-unscheduled counts plus per-node distribution from that snapshot using jq. Unscheduled detection is tightened to Pending pods with no spec.nodeName excluding Job-owned pods. On failure, unscheduled pod lists, a capped FailedScheduling event dump, and Failed-pod reason/messages are all produced from the same snapshot and use a shared namespace exclusion filter.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related issues

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(kwok): single-snapshot verify_pods + settle retry (#1090)' directly describes the main changes: a fix to use single snapshot per iteration and adding a settle/retry loop.
Description check ✅ Passed The description comprehensively details the problem (TOCTOU race and scheduler latency), the solution (single snapshot per iteration, bounded retry loop), implementation notes, testing, and risk assessment—directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from 223f847 to d702b9e Compare May 28, 2026 17:49
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Actionable comments posted: 0

@yuanchen8911 yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from d702b9e to e482776 Compare May 28, 2026 18:09
@yuanchen8911 yuanchen8911 changed the title fix(kwok): single-snapshot verify_pods to stop TOCTOU race (#1090) fix(kwok): single-snapshot verify_pods + settle retry (#1090) May 28, 2026
@yuanchen8911 yuanchen8911 requested review from lockwobr and mchmarny May 28, 2026 18:20
@yuanchen8911 yuanchen8911 marked this pull request as draft May 28, 2026 18:46
mchmarny
mchmarny previously approved these changes May 28, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix. The single-snapshot pattern eliminates the impossible-math signature from #1090 by construction, and the bounded settle loop addresses the post-sync scheduler latency observed on eks-inference (argocd-oci) / oke-training (argocd-oci) without adding latency on the happy path. CI is green across all 22 KWOK Tier-1 lanes on this revision.

Two nits on the loop machinery (neither blocking), plus props on the comment density — the inline rationale is going to save the next person a lot of root-cause time.

Comment thread kwok/scripts/validate-scheduling.sh
Comment thread kwok/scripts/validate-scheduling.sh Outdated
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Actionable comments posted: 0

NVIDIA#1090)

verify_pods() in kwok/scripts/validate-scheduling.sh had two flake
classes that produced false failures on the KWOK Tier-1 lanes:

1. TOCTOU race across multiple `kubectl get pods` calls. The previous
   implementation issued five separate queries — total / pending /
   failed / running / unscheduled — over several seconds (the
   unscheduled query alone takes hundreds of ms due to `-o json | jq`).
   Between calls, operator controllers (gpu-operator, dynamo, dra-
   driver, cert-manager, prometheus, …) kept reconciling and creating
   new pods, so the snapshots disagreed with each other. Symptom:

     Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed
     ERROR: Unscheduled pods:
            (kubectl returned "No resources found")
     ERROR: Events for unscheduled pods:
       kube-system  coredns-…  FailedScheduling  untolerated taint(s)

   `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1
   is impossible in a single state sample. The transient pod had been
   bound by the time the diagnostic re-queried, and the unfiltered
   event dump then surfaced kube-system noise from the intentional
   `nfd-excluded=true:NoSchedule` taint on the kind control-plane node
   — misleading at least two prior root-cause analyses.

2. Post-sync-gate scheduler latency. After Argo CD / Flux reports
   Synced+Healthy, controllers continue creating pods that have not
   yet been bound. A single coherent snapshot taken in that window
   correctly reports the pre-bind transient as a failure (verified on
   the eks-inference/argocd-oci and oke-training/argocd-oci lanes —
   cert-manager-786cc54854-kngvm and skyhook-operator-controller-
   manager-596b99b857-8shfm respectively, both with zero
   FailedScheduling events, meaning the scheduler simply had not
   gotten to them yet).

This change makes verify_pods deterministic by:

  1. Taking one `kubectl get pods --all-namespaces -o json` snapshot
     per attempt and deriving all five counts plus the pod
     distribution from it via jq. Within an attempt, the math is
     guaranteed coherent — the impossible-pending/unscheduled
     signature cannot reappear.
  2. Re-snapshotting on a bounded poll loop (default 60s deadline at
     5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE /
     KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop
     exits early as soon as a snapshot shows unscheduled == 0 (or
     any Failed pod, which is terminal). Only declares failure when
     the deadline is reached AND unscheduled is still positive.
  3. Rendering the "Unscheduled pods:" diagnostic and pod distribution
     from the same (final) snapshot — never prints "No resources
     found" when the count is positive.
  4. Filtering the FailedScheduling event dump with the same
     namespace exclusions used by the count. Persistent harness noise
     from the intentional nfd-excluded taint no longer surfaces as if
     it were the cause.
  5. Failing closed when `kubectl get pods` itself errors. The pre-
     existing `2>/dev/null | wc -l` chain silently produced
     total_pods=0 on apiserver outage / auth failure / bad context
     and flowed through the "bundle may be empty" PASS branch. The
     rewrite captures kubectl's stderr so the failure mode appears
     in the job log, then returns 1.
  6. (Drive-by, same shape) Rendering the failed-pod diagnostic from
     the same snapshot for consistency.

Race classes NVIDIA#1090 explicitly does NOT solve:
  - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062).
  - A genuinely unschedulable pod (missing toleration, no matching
    nodeSelector, insufficient resources). Those persist through the
    full deadline and are reported with the same diagnostic as before
    — now correctly filtered for namespace noise.

Locally smoke-tested with a synthetic snapshot covering excluded
namespaces, Job-owned helm hooks, and Failed pods. `make qualify`
clean (22/22 chainsaw, no vulns, lint 0 issues).

Fixes: NVIDIA#1090
@yuanchen8911 yuanchen8911 force-pushed the fix/kwok-verify-pods-race-1090 branch from c1a1b56 to bae72e2 Compare May 28, 2026 18:55
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 28, 2026 18:56
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 18:58
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Actionable comments posted: 0

@mchmarny mchmarny merged commit 216c76c into NVIDIA:main May 28, 2026
114 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(kwok): verify_pods samples pod state across 5 kubectl calls — TOCTOU race produces impossible math + misleading diagnostic

2 participants