fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters)#1408
Conversation
📝 WalkthroughWalkthrough
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
2154c2c to
a2f8562
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 155-164: When the polling loop exits due to context cancellation
via ctx.Ctx.Done() in the select statement, the code currently falls through to
subsequent logic that reports ErrCodeNotFound instead of properly indicating a
cancellation or timeout. To fix this, introduce a flag or status variable that
tracks whether the pollLoop exited due to context cancellation. Set this flag
when ctx.Ctx.Done() is selected in the select statement, then after the pollLoop
completes, check this flag and return an appropriate cancellation or timeout
error instead of allowing execution to continue and report ErrCodeNotFound.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d1518207-886e-4d30-92bf-f88163f2d1e6
📒 Files selected for processing (2)
docs/user/validation.mdvalidators/conformance/pod_autoscaling_check.go
a2f8562 to
ba602e6
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 155-158: The inline comment at lines 156-158 describing
cancellation behavior is outdated and does not match the actual control flow.
The comment states that context cancellation should result in a
"warn-and-continue" behavior, but the actual code at lines 171-174 hard-fails
with ErrCodeTimeout when ctx.Ctx is canceled. Update the comment to accurately
reflect that cancellation now causes a hard failure with ErrCodeTimeout instead
of the warn-and-continue behavior previously described.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 90c6fc33-42df-4b07-8c09-4fc3300e9399
📒 Files selected for processing (2)
docs/user/validation.mdvalidators/conformance/pod_autoscaling_check.go
The pod-autoscaling conformance check (CNCF #8b) hard-failed when pod-scoped GPU custom metrics (custom.metrics.k8s.io/.../pods/*/...) returned no data, with 'no GPU custom metrics available (DCGM -> Prometheus -> adapter pipeline broken)'. Those per-pod series require dcgm-exporter to attribute each GPU to its consuming pod via the kubelet pod-resources API, which only covers device-plugin (nvidia.com/gpu) allocations. DRA-claimed GPUs (nvidia-dra-driver-gpu, K8s 1.34+) are not attributed, so the series carry no exported_pod label and the adapter returns empty — failing the check on otherwise-healthy DRA clusters. The autoscaling capability is proven authoritatively by the external-metrics API plus the HPA behavioral scale-up/down test, which drive off a cluster-wide GPU metric (dcgm_gpu_power_usage) and need no per-pod attribution. Demote the pod-scoped custom-metrics probe to best-effort (warn + record artifact, then continue); steps 3 (external metrics) and 4 (HPA behavioral) remain hard, fail-closed gates. Verified on aicr3 (EKS H100, K8s 1.35, nvidia-dra-driver-gpu): external metrics returned data and the HPA path was healthy while pod-scoped metrics were empty.
ba602e6 to
abd90b4
Compare
mchmarny
left a comment
There was a problem hiding this comment.
LGTM. The gate demotion is done right: step 2 (pod-scoped GPU custom metrics) becomes best-effort warn-and-continue, but the authoritative capability proof — step 3 (external-metrics API has data) and step 4 (HPA behavioral scale-up/down on dcgm_gpu_power_usage) — stays fail-closed and unchanged. So this doesn't open a false-PASS hole; if anything the behavioral test is stronger evidence than a pod-scoped metric presence check ever was.
Verified the fail-closed paths hold:
- Context cancel/deadline still returns a timeout via the
if !found && ctx.Ctx.Err() != nilguard; the labeledbreak pollLoopcleanly separates that from plain attempt-exhaustion. - A genuinely broken DCGM→Prometheus→adapter pipeline on a device-plugin cluster is still caught downstream by step 3 (the external metric flows through the same pipeline), so the relaxation is scoped to the DRA pod-attribution gap as intended.
- Evidence artifact recorded in both branches — conformance audit trail preserved.
maxAttempts12→6 (60s) still covers one ~30s adapter relist on device-plugin clusters where the metric will appear; reasonable given it's no longer a gate.
godoc, inline comments, and the docs/user/validation.md row are all updated coherently, and the e2e (11/11 on a DRA EKS H100 cluster) backs the claim. Nothing blocking.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 171-174: The context cancellation check in the condition is
coupled with the !found check, which allows canceled contexts to be
misclassified as missing metrics when found is true. Separate the context
cancellation check from the found check by first checking if ctx.Ctx.Err() is
not nil regardless of the found state, and return the timeout error immediately
when the context is canceled, then handle the !found case separately afterwards.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4b3b1a9d-2af3-466b-a433-dc34cc98b157
📒 Files selected for processing (2)
docs/user/validation.mdvalidators/conformance/pod_autoscaling_check.go
| if !found && ctx.Ctx.Err() != nil { | ||
| return errors.Wrap(errors.ErrCodeTimeout, | ||
| "timed out waiting for GPU custom metrics", ctx.Ctx.Err()) | ||
| } |
There was a problem hiding this comment.
Handle canceled context regardless of found state.
Cancellation is only converted to ErrCodeTimeout when !found. If found == true and ctx.Ctx is already canceled, Line 199+ can return ErrCodeNotFound, misclassifying cancellation as missing metrics.
Suggested fix
- if !found && ctx.Ctx.Err() != nil {
+ if ctx.Ctx.Err() != nil {
return errors.Wrap(errors.ErrCodeTimeout,
"timed out waiting for GPU custom metrics", ctx.Ctx.Err())
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@validators/conformance/pod_autoscaling_check.go` around lines 171 - 174, The
context cancellation check in the condition is coupled with the !found check,
which allows canceled contexts to be misclassified as missing metrics when found
is true. Separate the context cancellation check from the found check by first
checking if ctx.Ctx.Err() is not nil regardless of the found state, and return
the timeout error immediately when the context is canceled, then handle the
!found case separately afterwards.
Summary
Make the
pod-autoscalingconformance check (CNCF #8b) pass on DRA-based GPU clusters by treating pod-scoped GPU custom metrics as best-effort, and proving the autoscaling capability via the external-metric HPA behavioral test instead.Motivation / Context
On DRA clusters the check hard-failed with
no GPU custom metrics available (DCGM → Prometheus → adapter pipeline broken). Step 2 required pod-scoped metrics (custom.metrics.k8s.io/.../pods/*/...), whoseexported_podassociation only exists when dcgm-exporter attributes a GPU to its consuming pod via the kubelet pod-resources API — which surfaces device-plugin (nvidia.com/gpu) allocations, not DRA claims (nvidia-dra-driver-gpu, K8s 1.34+). External metrics work (cluster-wide aggregates), which is the observed asymmetry. Not an AICR regression — the adapter rules (#168) are unchanged; it's a dcgm-exporter DRA pod-attribution gap surfaced by a brittle precondition.Fixes: #1407
Related: #1197
Type of Change
Component(s) Affected
pkg/validator)docs/,examples/)Implementation Notes
validators/conformance/pod_autoscaling_check.go: step 2 (pod-scoped GPU custom metrics) demoted from a hard gate to best-effort — on empty results it logs a warning and records an evidence artifact, then continues. Steps 3 (external-metrics API) and 4 (HPA behavioral scale-up/down, driven bydcgm_gpu_power_usage) remain hard, fail-closed gates and are the authoritative capability proof; they need no per-pod attribution.docs/user/validation.mdfeature description.Testing
End-to-end on aicr3 (EKS H100, K8s 1.35, nvidia-dra-driver-gpu) with a rebuilt conformance image: conformance 11/11 passed (
pod-autoscalingnow PASS via the external-metric HPA path); previously 10/11 withpod-autoscalingfailing. The HPA behavioral test deployed/scaled/cleaned its own namespace successfully on the DRA cluster.Risk Assessment
Rollout notes: No API/recipe surface change. Conformance pass/fail semantics: clusters that previously failed only on empty pod-scoped GPU metrics (DRA) now pass when the external-metric HPA path validates.
Checklist
make testwith-race) — N/A; check is live-cluster orchestration, no unit test exists for this file and no exported-function signature changedgolangci-lint, 0 issues)git commit -S)