Skip to content

fix(validator): wait for KAI deployments ready in gang-scheduling check#1514

Merged
mchmarny merged 4 commits into
mainfrom
fix/gang-scheduling-readiness-wait
Jun 28, 2026
Merged

fix(validator): wait for KAI deployments ready in gang-scheduling check#1514
mchmarny merged 4 commits into
mainfrom
fix/gang-scheduling-readiness-wait

Conversation

@mchmarny

Copy link
Copy Markdown
Member

Summary

Makes the gang-scheduling conformance check wait (bounded) for the KAI scheduler deployments to become Available, instead of sampling them instantaneously — fixing a flake that failed the whole conformance phase on a transient single-replica blip.

Motivation / Context

AWS UAT run 28312496969 failed with:

[NOT_FOUND] KAI scheduler component admission check failed:
[INTERNAL] deployment kai-scheduler/admission not available: 0/1 replicas

gang-scheduling was the only failing check (13 others passed; evidence emission succeeded), but --fail-on-error turned it into exit 8.

Root cause: CheckGangScheduling step 1 verified the 7 required KAI deployments with getDeploymentIfAvailable — a single Get asserting AvailableReplicas >= 1, no wait/retry. The single-replica admission webhook can be transiently unavailable (rollout, restart, node pressure), so a point-in-time read flakes the whole phase. Every other long-running readiness path in the package uses a bounded wait.PollUntilContextCancel; this one didn't. (Ruled out API drift: KAI v0.14.1 still serves scheduling.run.ai/v2alpha2 PodGroups, matching the validator's GVR.)

Fixes: #1513

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Validator (pkg/validator) — conformance validators (validators/conformance)

Implementation Notes

  • New waitForDeploymentAvailable(ctx, namespace, name, timeout) helper: bounded poll (defaults.PodPollInterval) until AvailableReplicas >= 1 or timeout. A not-yet-created deployment is "keep waiting" within the bound; a genuine outage still fails closed after the bound (NotFound vs not-available distinguished).
  • CheckGangScheduling step 1 now uses it with defaults.K8sPodReadyTimeout (2m) instead of the instantaneous check. The functional PodGroup co-scheduling test is unchanged.

Testing

make qualify

Added TestWaitForDeploymentAvailable_{ImmediatelyReady,TransientBlip,NeverReady,NotFound}. The TransientBlip case is the regression: a deployment reporting 0/1 on the first poll then Available re-polls and passes. Full validators/conformance suite passes; go build ./... clean; golangci-lint 0 issues.

Risk Assessment

  • Low — Adds a bounded readiness wait on an existing check; happy path is unchanged (returns immediately when Available); only effect is tolerating a transient blip before failing.

Rollout notes: None. No flags, APIs, or recipe changes.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (no user-facing change)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

…ng check

The gang-scheduling conformance check verified the 7 KAI deployments with an
instantaneous availability sample (getDeploymentIfAvailable), so a transient
0/1 on the single-replica admission webhook (rollout, restart, node pressure)
failed the whole conformance phase. Observed in AWS UAT run 28312496969 where
admission was momentarily 0/1 while every other check passed.

Add waitForDeploymentAvailable — a bounded poll (K8sPodReadyTimeout) that
tolerates a transient blip but still fails closed when a deployment is
genuinely down — and use it in CheckGangScheduling step 1.

Fixes: #1513
Signed-off-by: Mark Chmarny <[email protected]>
@mchmarny mchmarny requested a review from a team as a code owner June 28, 2026 16:12
@mchmarny mchmarny added the theme/validation Constraint evaluation, health checks, and conformance evidence label Jun 28, 2026
@mchmarny mchmarny self-assigned this Jun 28, 2026
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: bd0fac1f-1478-4dd4-8b19-c50f5fff7c24

📥 Commits

Reviewing files that changed from the base of the PR and between f4143e8 and 65c9a95.

📒 Files selected for processing (1)
  • validators/conformance/gang_scheduling_check.go

📝 Walkthrough

Walkthrough

CheckGangScheduling step 1 now polls each KAI scheduler deployment with waitForDeploymentAvailable instead of doing a single availability GET. The helper waits until Status.AvailableReplicas >= 1 or a timeout expires, and it distinguishes never-seen deployments, unavailable deployments, cancellation, and polling failures. New tests cover immediate readiness, transient recovery, timeout, cancellation, and missing deployment cases.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

area/tests

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: waiting for KAI deployments to become ready in the gang-scheduling check.
Description check ✅ Passed The description is directly about the gang-scheduling readiness fix and matches the code changes.
Linked Issues check ✅ Passed The PR implements the bounded readiness wait, preserves failure distinctions, and adds tests for the required transient and failure cases.
Out of Scope Changes check ✅ Passed The changes stay focused on the validator helper, its call site, and related tests with no obvious unrelated additions.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/gang-scheduling-readiness-wait

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
validators/conformance/gang_scheduling_check.go (1)

119-123: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Preserve the helper’s error code here. waitForDeploymentAvailable already uses ErrCodeNotFound only for a missing deployment; a deployment that exists but stays unavailable, or any API failure, comes back as ErrCodeInternal. Re-wrapping every failure here as ErrCodeNotFound makes those cases look missing. Return err directly, or use PropagateOrWrap if you want to keep the component name in the message.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/conformance/gang_scheduling_check.go` around lines 119 - 123, The
error handling in the gang scheduling check is overwriting the helper’s original
error code. In `waitForDeploymentAvailable` usage inside the scheduler check,
keep the returned `err` as-is instead of always wrapping it with
`ErrCodeNotFound`. If you want to add the component context from the scheduler
check, use `PropagateOrWrap` so `ErrCodeNotFound` stays only for missing
deployments and other failures preserve their original code.

Source: Learnings

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/helpers.go`:
- Around line 205-213: The readiness polling branch in the deployment helper is
treating caller cancellation the same as a deployment failure. In the helper
that checks `pollCtx.Err()`, first detect when the parent context (`ctx.Ctx`)
has been canceled and return that cancellation/abort error directly instead of
wrapping it as “deployment not found/not available.” Keep the existing
not-found/unavailable errors only for genuine readiness timeouts, using the same
`namespace`, `name`, `timeout`, `avail`, and `expected` context.

---

Outside diff comments:
In `@validators/conformance/gang_scheduling_check.go`:
- Around line 119-123: The error handling in the gang scheduling check is
overwriting the helper’s original error code. In `waitForDeploymentAvailable`
usage inside the scheduler check, keep the returned `err` as-is instead of
always wrapping it with `ErrCodeNotFound`. If you want to add the component
context from the scheduler check, use `PropagateOrWrap` so `ErrCodeNotFound`
stays only for missing deployments and other failures preserve their original
code.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 05abb91e-68f7-44fa-b8b4-57e54a9f52cf

📥 Commits

Reviewing files that changed from the base of the PR and between c1d2468 and 3629cb7.

📒 Files selected for processing (3)
  • validators/conformance/gang_scheduling_check.go
  • validators/conformance/helpers.go
  • validators/conformance/helpers_readiness_test.go

Comment thread validators/conformance/helpers.go
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 77.9%
Threshold 75%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-77.9%25-green)

Coverage unchanged by this PR.

… wait

Address CodeRabbit review: pollCtx derives from ctx.Ctx, so pollCtx.Err() is
also non-nil when the parent is canceled — the post-poll branch would then
misreport an external abort as 'deployment not available/not found after'.
Check the parent context first and propagate cancellation as ErrCodeTimeout;
only our own elapsed bound classifies the deployment as not-available. Adds a
parent-canceled regression test.

Signed-off-by: Mark Chmarny <[email protected]>
@github-actions github-actions Bot added size/L and removed size/M labels Jun 28, 2026
… caller

Address CodeRabbit review: the KAI deployment loop wrapped every
waitForDeploymentAvailable failure as ErrCodeNotFound, masking the helper's
Internal (not-available/API failure) and Timeout (cancellation) codes as
'missing'. Use PropagateOrWrap so NotFound stays only for genuinely-absent
deployments while other failures keep their code, retaining the component
name in the message.

Signed-off-by: Mark Chmarny <[email protected]>
@mchmarny mchmarny enabled auto-merge (squash) June 28, 2026 16:29
@mchmarny mchmarny merged commit 4b24b7d into main Jun 28, 2026
31 checks passed
@mchmarny mchmarny deleted the fix/gang-scheduling-readiness-wait branch June 28, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L theme/validation Constraint evaluation, health checks, and conformance evidence

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gang-scheduling conformance check flakes on transient KAI admission unavailability

2 participants