fix(validator): wait for KAI deployments ready in gang-scheduling check by mchmarny · Pull Request #1514 · NVIDIA/aicr

mchmarny · 2026-06-28T16:12:47Z

Summary

Makes the gang-scheduling conformance check wait (bounded) for the KAI scheduler deployments to become Available, instead of sampling them instantaneously — fixing a flake that failed the whole conformance phase on a transient single-replica blip.

Motivation / Context

AWS UAT run 28312496969 failed with:

[NOT_FOUND] KAI scheduler component admission check failed:
[INTERNAL] deployment kai-scheduler/admission not available: 0/1 replicas

gang-scheduling was the only failing check (13 others passed; evidence emission succeeded), but --fail-on-error turned it into exit 8.

Root cause: CheckGangScheduling step 1 verified the 7 required KAI deployments with getDeploymentIfAvailable — a single Get asserting AvailableReplicas >= 1, no wait/retry. The single-replica admission webhook can be transiently unavailable (rollout, restart, node pressure), so a point-in-time read flakes the whole phase. Every other long-running readiness path in the package uses a bounded wait.PollUntilContextCancel; this one didn't. (Ruled out API drift: KAI v0.14.1 still serves scheduling.run.ai/v2alpha2 PodGroups, matching the validator's GVR.)

Fixes: #1513

Type of Change

Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

Validator (pkg/validator) — conformance validators (validators/conformance)

Implementation Notes

New waitForDeploymentAvailable(ctx, namespace, name, timeout) helper: bounded poll (defaults.PodPollInterval) until AvailableReplicas >= 1 or timeout. A not-yet-created deployment is "keep waiting" within the bound; a genuine outage still fails closed after the bound (NotFound vs not-available distinguished).
CheckGangScheduling step 1 now uses it with defaults.K8sPodReadyTimeout (2m) instead of the instantaneous check. The functional PodGroup co-scheduling test is unchanged.

Testing

make qualify

Added TestWaitForDeploymentAvailable_{ImmediatelyReady,TransientBlip,NeverReady,NotFound}. The TransientBlip case is the regression: a deployment reporting 0/1 on the first poll then Available re-polls and passes. Full validators/conformance suite passes; go build ./... clean; golangci-lint 0 issues.

Risk Assessment

Low — Adds a bounded readiness wait on an existing check; happy path is unchanged (returns immediately when Available); only effect is tolerating a transient blip before failing.

Rollout notes: None. No flags, APIs, or recipe changes.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed (no user-facing change)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

…ng check The gang-scheduling conformance check verified the 7 KAI deployments with an instantaneous availability sample (getDeploymentIfAvailable), so a transient 0/1 on the single-replica admission webhook (rollout, restart, node pressure) failed the whole conformance phase. Observed in AWS UAT run 28312496969 where admission was momentarily 0/1 while every other check passed. Add waitForDeploymentAvailable — a bounded poll (K8sPodReadyTimeout) that tolerates a transient blip but still fails closed when a deployment is genuinely down — and use it in CheckGangScheduling step 1. Fixes: #1513 Signed-off-by: Mark Chmarny <[email protected]>

coderabbitai · 2026-06-28T16:18:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bd0fac1f-1478-4dd4-8b19-c50f5fff7c24

📥 Commits

Reviewing files that changed from the base of the PR and between f4143e8 and 65c9a95.

📒 Files selected for processing (1)

validators/conformance/gang_scheduling_check.go

📝 Walkthrough

Walkthrough

CheckGangScheduling step 1 now polls each KAI scheduler deployment with waitForDeploymentAvailable instead of doing a single availability GET. The helper waits until Status.AvailableReplicas >= 1 or a timeout expires, and it distinguishes never-seen deployments, unavailable deployments, cancellation, and polling failures. New tests cover immediate readiness, transient recovery, timeout, cancellation, and missing deployment cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

area/tests

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: waiting for KAI deployments to become ready in the gang-scheduling check.
Description check	✅ Passed	The description is directly about the gang-scheduling readiness fix and matches the code changes.
Linked Issues check	✅ Passed	The PR implements the bounded readiness wait, preserves failure distinctions, and adds tests for the required transient and failure cases.
Out of Scope Changes check	✅ Passed	The changes stay focused on the validator helper, its call site, and related tests with no obvious unrelated additions.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/gang-scheduling-readiness-wait

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

validators/conformance/gang_scheduling_check.go (1)
119-123: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Preserve the helper’s error code here. waitForDeploymentAvailable already uses ErrCodeNotFound only for a missing deployment; a deployment that exists but stays unavailable, or any API failure, comes back as ErrCodeInternal. Re-wrapping every failure here as ErrCodeNotFound makes those cases look missing. Return err directly, or use PropagateOrWrap if you want to keep the component name in the message.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/conformance/gang_scheduling_check.go` around lines 119 - 123, The
error handling in the gang scheduling check is overwriting the helper’s original
error code. In `waitForDeploymentAvailable` usage inside the scheduler check,
keep the returned `err` as-is instead of always wrapping it with
`ErrCodeNotFound`. If you want to add the component context from the scheduler
check, use `PropagateOrWrap` so `ErrCodeNotFound` stays only for missing
deployments and other failures preserve their original code.
Source: Learnings

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/helpers.go`:
- Around line 205-213: The readiness polling branch in the deployment helper is
treating caller cancellation the same as a deployment failure. In the helper
that checks `pollCtx.Err()`, first detect when the parent context (`ctx.Ctx`)
has been canceled and return that cancellation/abort error directly instead of
wrapping it as “deployment not found/not available.” Keep the existing
not-found/unavailable errors only for genuine readiness timeouts, using the same
`namespace`, `name`, `timeout`, `avail`, and `expected` context.

---

Outside diff comments:
In `@validators/conformance/gang_scheduling_check.go`:
- Around line 119-123: The error handling in the gang scheduling check is
overwriting the helper’s original error code. In `waitForDeploymentAvailable`
usage inside the scheduler check, keep the returned `err` as-is instead of
always wrapping it with `ErrCodeNotFound`. If you want to add the component
context from the scheduler check, use `PropagateOrWrap` so `ErrCodeNotFound`
stays only for missing deployments and other failures preserve their original
code.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 05abb91e-68f7-44fa-b8b4-57e54a9f52cf

📥 Commits

Reviewing files that changed from the base of the PR and between c1d2468 and 3629cb7.

📒 Files selected for processing (3)

validators/conformance/gang_scheduling_check.go
validators/conformance/helpers.go
validators/conformance/helpers_readiness_test.go

github-actions · 2026-06-28T16:18:39Z

Coverage Report ✅

Metric	Value
Coverage	77.9%
Threshold	75%
Status	Pass

Coverage Badge

![Coverage](https://img.shields.io/badge/coverage-77.9%25-green)

Coverage unchanged by this PR.

… wait Address CodeRabbit review: pollCtx derives from ctx.Ctx, so pollCtx.Err() is also non-nil when the parent is canceled — the post-poll branch would then misreport an external abort as 'deployment not available/not found after'. Check the parent context first and propagate cancellation as ErrCodeTimeout; only our own elapsed bound classifies the deployment as not-available. Adds a parent-canceled regression test. Signed-off-by: Mark Chmarny <[email protected]>

… caller Address CodeRabbit review: the KAI deployment loop wrapped every waitForDeploymentAvailable failure as ErrCodeNotFound, masking the helper's Internal (not-available/API failure) and Timeout (cancellation) codes as 'missing'. Use PropagateOrWrap so NotFound stays only for genuinely-absent deployments while other failures keep their code, retaining the component name in the message. Signed-off-by: Mark Chmarny <[email protected]>

mchmarny requested a review from a team as a code owner June 28, 2026 16:12

mchmarny added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 28, 2026

mchmarny self-assigned this Jun 28, 2026

Merge branch 'main' into fix/gang-scheduling-readiness-wait

ef5f308

github-actions Bot added the size/M label Jun 28, 2026

coderabbitai Bot reviewed Jun 28, 2026

View reviewed changes

Comment thread validators/conformance/helpers.go

github-actions Bot added size/L and removed size/M labels Jun 28, 2026

mchmarny enabled auto-merge (squash) June 28, 2026 16:29

dims approved these changes Jun 28, 2026

View reviewed changes

mchmarny merged commit 4b24b7d into main Jun 28, 2026
31 checks passed

mchmarny deleted the fix/gang-scheduling-readiness-wait branch June 28, 2026 16:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(validator): wait for KAI deployments ready in gang-scheduling check#1514

fix(validator): wait for KAI deployments ready in gang-scheduling check#1514
mchmarny merged 4 commits into
mainfrom
fix/gang-scheduling-readiness-wait

mchmarny commented Jun 28, 2026

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mchmarny commented Jun 28, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Suggested labels

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report ✅

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

github-actions Bot commented Jun 28, 2026 •

edited

Loading