fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters) by yuanchen8911 · Pull Request #1408 · NVIDIA/aicr

yuanchen8911 · 2026-06-22T18:05:51Z

Summary

Make the pod-autoscaling conformance check (CNCF #8b) pass on DRA-based GPU clusters by treating pod-scoped GPU custom metrics as best-effort, and proving the autoscaling capability via the external-metric HPA behavioral test instead.

Motivation / Context

On DRA clusters the check hard-failed with no GPU custom metrics available (DCGM → Prometheus → adapter pipeline broken). Step 2 required pod-scoped metrics (custom.metrics.k8s.io/.../pods/*/...), whose exported_pod association only exists when dcgm-exporter attributes a GPU to its consuming pod via the kubelet pod-resources API — which surfaces device-plugin (nvidia.com/gpu) allocations, not DRA claims (nvidia-dra-driver-gpu, K8s 1.34+). External metrics work (cluster-wide aggregates), which is the observed asymmetry. Not an AICR regression — the adapter rules (#168) are unchanged; it's a dcgm-exporter DRA pod-attribution gap surfaced by a brittle precondition.

Fixes: #1407
Related: #1197

Type of Change

Bug fix (non-breaking change that fixes an issue)
Documentation update

Component(s) Affected

Validator (pkg/validator)
Docs/examples (docs/, examples/)

Implementation Notes

validators/conformance/pod_autoscaling_check.go: step 2 (pod-scoped GPU custom metrics) demoted from a hard gate to best-effort — on empty results it logs a warning and records an evidence artifact, then continues. Steps 3 (external-metrics API) and 4 (HPA behavioral scale-up/down, driven by dcgm_gpu_power_usage) remain hard, fail-closed gates and are the authoritative capability proof; they need no per-pod attribution.
Updated the function godoc and the docs/user/validation.md feature description.

Testing

golangci-lint run -c .golangci.yaml ./validators/conformance/...   # 0 issues
go build ./validators/conformance/...                              # ok

End-to-end on aicr3 (EKS H100, K8s 1.35, nvidia-dra-driver-gpu) with a rebuilt conformance image: conformance 11/11 passed (pod-autoscaling now PASS via the external-metric HPA path); previously 10/11 with pod-autoscaling failing. The HPA behavioral test deployed/scaled/cleaned its own namespace successfully on the DRA cluster.

Risk Assessment

Low — Relaxes one brittle precondition in a single conformance check; the authoritative HPA behavioral gate is unchanged and still fail-closed. Easy to revert.

Rollout notes: No API/recipe surface change. Conformance pass/fail semantics: clusters that previously failed only on empty pod-scoped GPU metrics (DRA) now pass when the external-metric HPA path validates.

Checklist

Tests pass locally (make test with -race) — N/A; check is live-cluster orchestration, no unit test exists for this file and no exported-function signature changed
Linter passes (golangci-lint, 0 issues)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (no unit-testable surface added)
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

github-actions · 2026-06-22T18:06:58Z

🌿 Preview your docs: https://nvidia-preview-fix-pod-autoscaling-dra-aware.docs.buildwithfern.com/aicr

coderabbitai · 2026-06-22T18:08:43Z

📝 Walkthrough

Walkthrough

CheckPodAutoscaling in validators/conformance/pod_autoscaling_check.go is updated so that absence of pod-scoped GPU custom metrics no longer returns ErrCodeNotFound. Instead, the function emits a warning log and records a "not available" artifact, then continues to the external-metrics API and HPA behavioral scale-up/down test, which remain hard gates. Expanded code comments document the DRA pod-attribution gap as the reason, and the polling budget is adjusted to a shorter one-relist-cycle window. The pod-autoscaling feature row in docs/user/validation.md is updated to reflect the same behavior: external GPU metric HPA, behavioral test, and best-effort pod-scoped evidence.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: fixing the pod-autoscaling validator to pass on DRA clusters via the external-metric HPA path, matching the key modification in the code.
Description check	✅ Passed	The description is well-detailed, clearly explaining the motivation (DRA clusters failing on pod-scoped metrics), the fix (demoting pod-scoped metrics to best-effort), and the validation (11/11 conformance tests passing).
Linked Issues check	✅ Passed	The PR directly addresses issue `#1407` by demoting pod-scoped custom metrics to best-effort and relying on external-metrics API and HPA behavioral tests [`#1407`], matching all stated objectives.
Out of Scope Changes check	✅ Passed	All changes are scoped to the pod-autoscaling conformance check logic and documentation; no unrelated modifications or scope creep detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 155-164: When the polling loop exits due to context cancellation
via ctx.Ctx.Done() in the select statement, the code currently falls through to
subsequent logic that reports ErrCodeNotFound instead of properly indicating a
cancellation or timeout. To fix this, introduce a flag or status variable that
tracks whether the pollLoop exited due to context cancellation. Set this flag
when ctx.Ctx.Done() is selected in the select statement, then after the pollLoop
completes, check this flag and return an appropriate cancellation or timeout
error instead of allowing execution to continue and report ErrCodeNotFound.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d1518207-886e-4d30-92bf-f88163f2d1e6

📥 Commits

Reviewing files that changed from the base of the PR and between 2154c2c and a2f8562.

📒 Files selected for processing (2)

docs/user/validation.md
validators/conformance/pod_autoscaling_check.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 155-158: The inline comment at lines 156-158 describing
cancellation behavior is outdated and does not match the actual control flow.
The comment states that context cancellation should result in a
"warn-and-continue" behavior, but the actual code at lines 171-174 hard-fails
with ErrCodeTimeout when ctx.Ctx is canceled. Update the comment to accurately
reflect that cancellation now causes a hard failure with ErrCodeTimeout instead
of the warn-and-continue behavior previously described.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 90c6fc33-42df-4b07-8c09-4fc3300e9399

📥 Commits

Reviewing files that changed from the base of the PR and between a2f8562 and ba602e6.

📒 Files selected for processing (2)

docs/user/validation.md
validators/conformance/pod_autoscaling_check.go

The pod-autoscaling conformance check (CNCF #8b) hard-failed when pod-scoped GPU custom metrics (custom.metrics.k8s.io/.../pods/*/...) returned no data, with 'no GPU custom metrics available (DCGM -> Prometheus -> adapter pipeline broken)'. Those per-pod series require dcgm-exporter to attribute each GPU to its consuming pod via the kubelet pod-resources API, which only covers device-plugin (nvidia.com/gpu) allocations. DRA-claimed GPUs (nvidia-dra-driver-gpu, K8s 1.34+) are not attributed, so the series carry no exported_pod label and the adapter returns empty — failing the check on otherwise-healthy DRA clusters. The autoscaling capability is proven authoritatively by the external-metrics API plus the HPA behavioral scale-up/down test, which drive off a cluster-wide GPU metric (dcgm_gpu_power_usage) and need no per-pod attribution. Demote the pod-scoped custom-metrics probe to best-effort (warn + record artifact, then continue); steps 3 (external metrics) and 4 (HPA behavioral) remain hard, fail-closed gates. Verified on aicr3 (EKS H100, K8s 1.35, nvidia-dra-driver-gpu): external metrics returned data and the HPA path was healthy while pod-scoped metrics were empty.

mchmarny

LGTM. The gate demotion is done right: step 2 (pod-scoped GPU custom metrics) becomes best-effort warn-and-continue, but the authoritative capability proof — step 3 (external-metrics API has data) and step 4 (HPA behavioral scale-up/down on dcgm_gpu_power_usage) — stays fail-closed and unchanged. So this doesn't open a false-PASS hole; if anything the behavioral test is stronger evidence than a pod-scoped metric presence check ever was.

Verified the fail-closed paths hold:

Context cancel/deadline still returns a timeout via the if !found && ctx.Ctx.Err() != nil guard; the labeled break pollLoop cleanly separates that from plain attempt-exhaustion.
A genuinely broken DCGM→Prometheus→adapter pipeline on a device-plugin cluster is still caught downstream by step 3 (the external metric flows through the same pipeline), so the relaxation is scoped to the DRA pod-attribution gap as intended.
Evidence artifact recorded in both branches — conformance audit trail preserved.
maxAttempts 12→6 (60s) still covers one ~30s adapter relist on device-plugin clusters where the metric will appear; reasonable given it's no longer a gate.

godoc, inline comments, and the docs/user/validation.md row are all updated coherently, and the e2e (11/11 on a DRA EKS H100 cluster) backs the claim. Nothing blocking.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 171-174: The context cancellation check in the condition is
coupled with the !found check, which allows canceled contexts to be
misclassified as missing metrics when found is true. Separate the context
cancellation check from the found check by first checking if ctx.Ctx.Err() is
not nil regardless of the found state, and return the timeout error immediately
when the context is canceled, then handle the !found case separately afterwards.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4b3b1a9d-2af3-466b-a433-dc34cc98b157

📥 Commits

Reviewing files that changed from the base of the PR and between ba602e6 and abd90b4.

📒 Files selected for processing (2)

docs/user/validation.md
validators/conformance/pod_autoscaling_check.go

coderabbitai · 2026-06-22T18:37:44Z

+	if !found && ctx.Ctx.Err() != nil {
+		return errors.Wrap(errors.ErrCodeTimeout,
+			"timed out waiting for GPU custom metrics", ctx.Ctx.Err())
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle canceled context regardless of found state.

Cancellation is only converted to ErrCodeTimeout when !found. If found == true and ctx.Ctx is already canceled, Line 199+ can return ErrCodeNotFound, misclassifying cancellation as missing metrics.

Suggested fix

- if !found && ctx.Ctx.Err() != nil { + if ctx.Ctx.Err() != nil { return errors.Wrap(errors.ErrCodeTimeout, "timed out waiting for GPU custom metrics", ctx.Ctx.Err()) }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@validators/conformance/pod_autoscaling_check.go` around lines 171 - 174, The context cancellation check in the condition is coupled with the !found check, which allows canceled contexts to be misclassified as missing metrics when found is true. Separate the context cancellation check from the found check by first checking if ctx.Ctx.Err() is not nil regardless of the found state, and return the timeout error immediately when the context is canceled, then handle the !found case separately afterwards.

yuanchen8911 requested a review from a team as a code owner June 22, 2026 18:05

yuanchen8911 added area/validator area/docs labels Jun 22, 2026

github-actions Bot added size/S and removed area/validator labels Jun 22, 2026

yuanchen8911 requested a review from mchmarny June 22, 2026 18:10

yuanchen8911 force-pushed the fix/pod-autoscaling-dra-aware branch from 2154c2c to a2f8562 Compare June 22, 2026 18:16

github-actions Bot added size/M and removed size/S labels Jun 22, 2026

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread validators/conformance/pod_autoscaling_check.go

yuanchen8911 force-pushed the fix/pod-autoscaling-dra-aware branch from a2f8562 to ba602e6 Compare June 22, 2026 18:21

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

Comment thread validators/conformance/pod_autoscaling_check.go Outdated

yuanchen8911 force-pushed the fix/pod-autoscaling-dra-aware branch from ba602e6 to abd90b4 Compare June 22, 2026 18:34

mchmarny approved these changes Jun 22, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 22, 2026

View reviewed changes

yuanchen8911 enabled auto-merge (squash) June 22, 2026 18:37

yuanchen8911 merged commit 410e31a into NVIDIA:main Jun 22, 2026
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters)#1408

fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters)#1408
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/pod-autoscaling-dra-aware

yuanchen8911 commented Jun 22, 2026

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

mchmarny left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yuanchen8911 commented Jun 22, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

coderabbitai Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 22, 2026 •

edited

Loading