fix(kwok): single-snapshot verify_pods + settle retry (#1090)#1092
Conversation
|
Actionable comments posted: 0 |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR refactors kwok/scripts/validate-scheduling.sh::verify_pods() to snapshot all pods once per retry via kubectl get pods --all-namespaces -o json, validate KWOK_VERIFY_PODS_DEADLINE/INTERVAL, and compute total/running/pending/failed/truly-unscheduled counts plus per-node distribution from that snapshot using jq. Unscheduled detection is tightened to Pending pods with no spec.nodeName excluding Job-owned pods. On failure, unscheduled pod lists, a capped FailedScheduling event dump, and Failed-pod reason/messages are all produced from the same snapshot and use a shared namespace exclusion filter. Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
223f847 to
d702b9e
Compare
|
Actionable comments posted: 0 |
d702b9e to
e482776
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Solid fix. The single-snapshot pattern eliminates the impossible-math signature from #1090 by construction, and the bounded settle loop addresses the post-sync scheduler latency observed on eks-inference (argocd-oci) / oke-training (argocd-oci) without adding latency on the happy path. CI is green across all 22 KWOK Tier-1 lanes on this revision.
Two nits on the loop machinery (neither blocking), plus props on the comment density — the inline rationale is going to save the next person a lot of root-cause time.
e482776 to
c1a1b56
Compare
|
Actionable comments posted: 0 |
NVIDIA#1090) verify_pods() in kwok/scripts/validate-scheduling.sh had two flake classes that produced false failures on the KWOK Tier-1 lanes: 1. TOCTOU race across multiple `kubectl get pods` calls. The previous implementation issued five separate queries — total / pending / failed / running / unscheduled — over several seconds (the unscheduled query alone takes hundreds of ms due to `-o json | jq`). Between calls, operator controllers (gpu-operator, dynamo, dra- driver, cert-manager, prometheus, …) kept reconciling and creating new pods, so the snapshots disagreed with each other. Symptom: Pod status: 20 total, 20 running, 0 pending (1 unscheduled), 0 failed ERROR: Unscheduled pods: (kubectl returned "No resources found") ERROR: Events for unscheduled pods: kube-system coredns-… FailedScheduling untolerated taint(s) `unscheduled` is a subset of `pending`, so pending=0 + unscheduled=1 is impossible in a single state sample. The transient pod had been bound by the time the diagnostic re-queried, and the unfiltered event dump then surfaced kube-system noise from the intentional `nfd-excluded=true:NoSchedule` taint on the kind control-plane node — misleading at least two prior root-cause analyses. 2. Post-sync-gate scheduler latency. After Argo CD / Flux reports Synced+Healthy, controllers continue creating pods that have not yet been bound. A single coherent snapshot taken in that window correctly reports the pre-bind transient as a failure (verified on the eks-inference/argocd-oci and oke-training/argocd-oci lanes — cert-manager-786cc54854-kngvm and skyhook-operator-controller- manager-596b99b857-8shfm respectively, both with zero FailedScheduling events, meaning the scheduler simply had not gotten to them yet). This change makes verify_pods deterministic by: 1. Taking one `kubectl get pods --all-namespaces -o json` snapshot per attempt and deriving all five counts plus the pod distribution from it via jq. Within an attempt, the math is guaranteed coherent — the impossible-pending/unscheduled signature cannot reappear. 2. Re-snapshotting on a bounded poll loop (default 60s deadline at 5s intervals, both tunable via KWOK_VERIFY_PODS_DEADLINE / KWOK_VERIFY_PODS_INTERVAL) when unscheduled_pods > 0. The loop exits early as soon as a snapshot shows unscheduled == 0 (or any Failed pod, which is terminal). Only declares failure when the deadline is reached AND unscheduled is still positive. 3. Rendering the "Unscheduled pods:" diagnostic and pod distribution from the same (final) snapshot — never prints "No resources found" when the count is positive. 4. Filtering the FailedScheduling event dump with the same namespace exclusions used by the count. Persistent harness noise from the intentional nfd-excluded taint no longer surfaces as if it were the cause. 5. Failing closed when `kubectl get pods` itself errors. The pre- existing `2>/dev/null | wc -l` chain silently produced total_pods=0 on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. The rewrite captures kubectl's stderr so the failure mode appears in the job log, then returns 1. 6. (Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency. Race classes NVIDIA#1090 explicitly does NOT solve: - The Argo CD sync-gate race in NVIDIA#1061 (fixed separately by NVIDIA#1062). - A genuinely unschedulable pod (missing toleration, no matching nodeSelector, insufficient resources). Those persist through the full deadline and are reported with the same diagnostic as before — now correctly filtered for namespace noise. Locally smoke-tested with a synthetic snapshot covering excluded namespaces, Job-owned helm hooks, and Failed pods. `make qualify` clean (22/22 chainsaw, no vulns, lint 0 issues). Fixes: NVIDIA#1090
c1a1b56 to
bae72e2
Compare
|
Actionable comments posted: 0 |
Summary
Rewrites
verify_pods()inkwok/scripts/validate-scheduling.shto take one coherentkubectl get pods -o jsonsnapshot per attempt, polls on a bounded scheduler-settle window when pods are still unscheduled, and derives every count, the pod distribution, and the failure diagnostic from the same snapshot — eliminating both the TOCTOU race (impossible math) and the post-sync-gate scheduler-latency flake.Motivation / Context
verify_pods()had two flake classes producing false failures on KWOK Tier-1 lanes:1. TOCTOU race across five separate
kubectl get podscalls. total / pending / failed / running / unscheduled were sampled sequentially over several seconds (the unscheduled query alone takes hundreds of ms due to-o json | jq). Between calls, operator controllers (gpu-operator, dynamo, dra-driver, …) kept creating pods, so the snapshots disagreed:unscheduledis a subset ofpending, sopending=0 + unscheduled=1cannot come from one state. The unfiltered event dump then surfaced kube-system noise from the intentionalnfd-excluded=true:NoScheduletaint, misleading multiple prior reviews.2. Post-sync-gate scheduler latency. Even after switching to a single coherent snapshot, two lanes failed with internally-consistent math but a real Pending pod that simply hadn't been bound yet:
eks-inference (argocd-oci):cert-manager/cert-manager-786cc54854-kngvm—Pod status: 2 total, 0 running, 2 pending (1 unscheduled), 0 failed, zero FailedScheduling eventsoke-training (argocd-oci):skyhook/skyhook-operator-controller-manager-596b99b857-8shfm—Pod status: 12 total, 11 running, 1 pending (1 unscheduled), 0 failed, zero FailedScheduling eventsZero FailedScheduling events means the scheduler hadn't tried to schedule yet — pure latency between Argo CD reporting Synced+Healthy and the scheduler catching up. A single instantaneous snapshot would fail-fast on this transient.
Issue #1090 has the full evidence trail (jobs
78343519974,78341110233for the TOCTOU race; jobs78353433292,78353436030for the post-sync latency).Fixes: #1090
Related: #1061 (closed, sibling sync-gate race fixed by #1062 — distinct from this one)
Type of Change
Component(s) Affected
kwok/scripts/validate-scheduling.sh)Implementation Notes
Six changes inside
verify_pods():Single snapshot per attempt. One
kubectl get pods --all-namespaces -o jsonper loop iteration; all five counts (total / running / pending / failed / unscheduled) plus pod distribution derived from it via jq. Within one iteration the math is guaranteed coherent — impossible-math signatures cannot reappear.Bounded scheduler-settle poll loop. When
unscheduled_pods > 0, the loop re-snapshots up toKWOK_VERIFY_PODS_DEADLINEseconds (default60) atKWOK_VERIFY_PODS_INTERVALspacing (default5). Early-exits as soon as a snapshot showsunscheduled == 0(or anyFailed > 0, which is terminal). Declares failure only if the deadline elapses withunscheduledstill positive — uses the FINAL snapshot for diagnostics so the named pod is the one that was still unscheduled at the deadline.Diagnostics from the same snapshot. "Unscheduled pods:" listing and pod distribution come from the final snapshot — never prints "No resources found" when the count is positive.
Filtered FailedScheduling event dump. Piped through
grep -vE "^(${exclude_ns})\s"so kube-system / local-path-storage noise from the intentionalnfd-excludedtaint stops surfacing as if it were the cause.Fail-closed on kubectl error. Pre-existing
2>/dev/null | wc -lsilently producedtotal_pods=0on apiserver outage / auth failure / bad context and flowed through the "bundle may be empty" PASS branch. Now captures kubectl's stderr into amktempfile and surfaces it inlog_errorbeforereturn 1.(Drive-by, same shape) Rendering the failed-pod diagnostic from the same snapshot for consistency.
Tunables
KWOK_VERIFY_PODS_DEADLINE60KWOK_VERIFY_PODS_INTERVAL5Per-iteration cost: 1 kubectl call + 5 jq calls (~0.5s total). With default deadline, max ~13 iterations. Loop exits early on success, so the additional cost is paid only when scheduler latency is real — for hung pods that would have failed anyway, this adds the deadline as a fixed cost (worth it for stability).
What this PR does NOT change
nfd-excluded=true:NoScheduletaint on the kind control-plane (still applied; only the diagnostic stopped surfacing its noise).nodeSelector, insufficient resources): those persist through the full deadline and are reported with the same diagnostic.Testing
End-to-end verification: the prior CI run on this PR (commit
d702b9e9) hit the post-sync-gate scheduler latency on two argocd-oci lanes — exactly the case the new poll loop addresses. The current revision (e4827765) is the response to that finding. The next KWOK run on this PR is the natural integration test; if any lane still fails, the failure will name a specific pod that remained unscheduled for the full 60s deadline, which is a real bug not a flake.Coverage: script-only change (no Go), so the per-package coverage gate does not apply per CLAUDE.md.
Risk Assessment
Rollout notes: None. Takes effect on first run after merge. No flag, no migration. The two new env vars default to the values used during development; operators only need to set them if they want a shorter/longer settle window.
Checklist
make testwith-race) — covered bymake qualifymake lint) — 0 issuesgit commit -S)