fix(validator): wait for KAI deployments ready in gang-scheduling check#1514
Conversation
…ng check The gang-scheduling conformance check verified the 7 KAI deployments with an instantaneous availability sample (getDeploymentIfAvailable), so a transient 0/1 on the single-replica admission webhook (rollout, restart, node pressure) failed the whole conformance phase. Observed in AWS UAT run 28312496969 where admission was momentarily 0/1 while every other check passed. Add waitForDeploymentAvailable — a bounded poll (K8sPodReadyTimeout) that tolerates a transient blip but still fails closed when a deployment is genuinely down — and use it in CheckGangScheduling step 1. Fixes: #1513 Signed-off-by: Mark Chmarny <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthrough
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
validators/conformance/gang_scheduling_check.go (1)
119-123: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winPreserve the helper’s error code here.
waitForDeploymentAvailablealready usesErrCodeNotFoundonly for a missing deployment; a deployment that exists but stays unavailable, or any API failure, comes back asErrCodeInternal. Re-wrapping every failure here asErrCodeNotFoundmakes those cases look missing. Returnerrdirectly, or usePropagateOrWrapif you want to keep the component name in the message.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/conformance/gang_scheduling_check.go` around lines 119 - 123, The error handling in the gang scheduling check is overwriting the helper’s original error code. In `waitForDeploymentAvailable` usage inside the scheduler check, keep the returned `err` as-is instead of always wrapping it with `ErrCodeNotFound`. If you want to add the component context from the scheduler check, use `PropagateOrWrap` so `ErrCodeNotFound` stays only for missing deployments and other failures preserve their original code.Source: Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/conformance/helpers.go`:
- Around line 205-213: The readiness polling branch in the deployment helper is
treating caller cancellation the same as a deployment failure. In the helper
that checks `pollCtx.Err()`, first detect when the parent context (`ctx.Ctx`)
has been canceled and return that cancellation/abort error directly instead of
wrapping it as “deployment not found/not available.” Keep the existing
not-found/unavailable errors only for genuine readiness timeouts, using the same
`namespace`, `name`, `timeout`, `avail`, and `expected` context.
---
Outside diff comments:
In `@validators/conformance/gang_scheduling_check.go`:
- Around line 119-123: The error handling in the gang scheduling check is
overwriting the helper’s original error code. In `waitForDeploymentAvailable`
usage inside the scheduler check, keep the returned `err` as-is instead of
always wrapping it with `ErrCodeNotFound`. If you want to add the component
context from the scheduler check, use `PropagateOrWrap` so `ErrCodeNotFound`
stays only for missing deployments and other failures preserve their original
code.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 05abb91e-68f7-44fa-b8b4-57e54a9f52cf
📒 Files selected for processing (3)
validators/conformance/gang_scheduling_check.govalidators/conformance/helpers.govalidators/conformance/helpers_readiness_test.go
Coverage Report ✅
Coverage BadgeCoverage unchanged by this PR. |
… wait Address CodeRabbit review: pollCtx derives from ctx.Ctx, so pollCtx.Err() is also non-nil when the parent is canceled — the post-poll branch would then misreport an external abort as 'deployment not available/not found after'. Check the parent context first and propagate cancellation as ErrCodeTimeout; only our own elapsed bound classifies the deployment as not-available. Adds a parent-canceled regression test. Signed-off-by: Mark Chmarny <[email protected]>
… caller Address CodeRabbit review: the KAI deployment loop wrapped every waitForDeploymentAvailable failure as ErrCodeNotFound, masking the helper's Internal (not-available/API failure) and Timeout (cancellation) codes as 'missing'. Use PropagateOrWrap so NotFound stays only for genuinely-absent deployments while other failures keep their code, retaining the component name in the message. Signed-off-by: Mark Chmarny <[email protected]>
Summary
Makes the
gang-schedulingconformance check wait (bounded) for the KAI scheduler deployments to become Available, instead of sampling them instantaneously — fixing a flake that failed the whole conformance phase on a transient single-replica blip.Motivation / Context
AWS UAT run 28312496969 failed with:
gang-schedulingwas the only failing check (13 others passed; evidence emission succeeded), but--fail-on-errorturned it into exit 8.Root cause:
CheckGangSchedulingstep 1 verified the 7 required KAI deployments withgetDeploymentIfAvailable— a singleGetassertingAvailableReplicas >= 1, no wait/retry. The single-replicaadmissionwebhook can be transiently unavailable (rollout, restart, node pressure), so a point-in-time read flakes the whole phase. Every other long-running readiness path in the package uses a boundedwait.PollUntilContextCancel; this one didn't. (Ruled out API drift: KAI v0.14.1 still servesscheduling.run.ai/v2alpha2PodGroups, matching the validator's GVR.)Fixes: #1513
Type of Change
Component(s) Affected
pkg/validator) — conformance validators (validators/conformance)Implementation Notes
waitForDeploymentAvailable(ctx, namespace, name, timeout)helper: bounded poll (defaults.PodPollInterval) untilAvailableReplicas >= 1ortimeout. A not-yet-created deployment is "keep waiting" within the bound; a genuine outage still fails closed after the bound (NotFound vs not-available distinguished).CheckGangSchedulingstep 1 now uses it withdefaults.K8sPodReadyTimeout(2m) instead of the instantaneous check. The functional PodGroup co-scheduling test is unchanged.Testing
Added
TestWaitForDeploymentAvailable_{ImmediatelyReady,TransientBlip,NeverReady,NotFound}. TheTransientBlipcase is the regression: a deployment reporting 0/1 on the first poll then Available re-polls and passes. Fullvalidators/conformancesuite passes;go build ./...clean; golangci-lint 0 issues.Risk Assessment
Rollout notes: None. No flags, APIs, or recipe changes.
Checklist
make testwith-race)make lint)git commit -S)