Skip to content

fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters)#1408

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/pod-autoscaling-dra-aware
Jun 22, 2026
Merged

fix(validator): pod-autoscaling passes on external-metric HPA path (DRA clusters)#1408
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/pod-autoscaling-dra-aware

Conversation

@yuanchen8911

Copy link
Copy Markdown
Contributor

Summary

Make the pod-autoscaling conformance check (CNCF #8b) pass on DRA-based GPU clusters by treating pod-scoped GPU custom metrics as best-effort, and proving the autoscaling capability via the external-metric HPA behavioral test instead.

Motivation / Context

On DRA clusters the check hard-failed with no GPU custom metrics available (DCGM → Prometheus → adapter pipeline broken). Step 2 required pod-scoped metrics (custom.metrics.k8s.io/.../pods/*/...), whose exported_pod association only exists when dcgm-exporter attributes a GPU to its consuming pod via the kubelet pod-resources API — which surfaces device-plugin (nvidia.com/gpu) allocations, not DRA claims (nvidia-dra-driver-gpu, K8s 1.34+). External metrics work (cluster-wide aggregates), which is the observed asymmetry. Not an AICR regression — the adapter rules (#168) are unchanged; it's a dcgm-exporter DRA pod-attribution gap surfaced by a brittle precondition.

Fixes: #1407
Related: #1197

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • Documentation update

Component(s) Affected

  • Validator (pkg/validator)
  • Docs/examples (docs/, examples/)

Implementation Notes

  • validators/conformance/pod_autoscaling_check.go: step 2 (pod-scoped GPU custom metrics) demoted from a hard gate to best-effort — on empty results it logs a warning and records an evidence artifact, then continues. Steps 3 (external-metrics API) and 4 (HPA behavioral scale-up/down, driven by dcgm_gpu_power_usage) remain hard, fail-closed gates and are the authoritative capability proof; they need no per-pod attribution.
  • Updated the function godoc and the docs/user/validation.md feature description.

Testing

golangci-lint run -c .golangci.yaml ./validators/conformance/...   # 0 issues
go build ./validators/conformance/...                              # ok

End-to-end on aicr3 (EKS H100, K8s 1.35, nvidia-dra-driver-gpu) with a rebuilt conformance image: conformance 11/11 passed (pod-autoscaling now PASS via the external-metric HPA path); previously 10/11 with pod-autoscaling failing. The HPA behavioral test deployed/scaled/cleaned its own namespace successfully on the DRA cluster.

Risk Assessment

  • Low — Relaxes one brittle precondition in a single conformance check; the authoritative HPA behavioral gate is unchanged and still fail-closed. Easy to revert.

Rollout notes: No API/recipe surface change. Conformance pass/fail semantics: clusters that previously failed only on empty pod-scoped GPU metrics (DRA) now pass when the external-metric HPA path validates.

Checklist

  • Tests pass locally (make test with -race) — N/A; check is live-cluster orchestration, no unit test exists for this file and no exported-function signature changed
  • Linter passes (golangci-lint, 0 issues)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (no unit-testable surface added)
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

CheckPodAutoscaling in validators/conformance/pod_autoscaling_check.go is updated so that absence of pod-scoped GPU custom metrics no longer returns ErrCodeNotFound. Instead, the function emits a warning log and records a "not available" artifact, then continues to the external-metrics API and HPA behavioral scale-up/down test, which remain hard gates. Expanded code comments document the DRA pod-attribution gap as the reason, and the polling budget is adjusted to a shorter one-relist-cycle window. The pod-autoscaling feature row in docs/user/validation.md is updated to reflect the same behavior: external GPU metric HPA, behavioral test, and best-effort pod-scoped evidence.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: fixing the pod-autoscaling validator to pass on DRA clusters via the external-metric HPA path, matching the key modification in the code.
Description check ✅ Passed The description is well-detailed, clearly explaining the motivation (DRA clusters failing on pod-scoped metrics), the fix (demoting pod-scoped metrics to best-effort), and the validation (11/11 conformance tests passing).
Linked Issues check ✅ Passed The PR directly addresses issue #1407 by demoting pod-scoped custom metrics to best-effort and relying on external-metrics API and HPA behavioral tests [#1407], matching all stated objectives.
Out of Scope Changes check ✅ Passed All changes are scoped to the pod-autoscaling conformance check logic and documentation; no unrelated modifications or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@yuanchen8911 yuanchen8911 requested a review from mchmarny June 22, 2026 18:10
@yuanchen8911 yuanchen8911 force-pushed the fix/pod-autoscaling-dra-aware branch from 2154c2c to a2f8562 Compare June 22, 2026 18:16
@github-actions github-actions Bot added size/M and removed size/S labels Jun 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 155-164: When the polling loop exits due to context cancellation
via ctx.Ctx.Done() in the select statement, the code currently falls through to
subsequent logic that reports ErrCodeNotFound instead of properly indicating a
cancellation or timeout. To fix this, introduce a flag or status variable that
tracks whether the pollLoop exited due to context cancellation. Set this flag
when ctx.Ctx.Done() is selected in the select statement, then after the pollLoop
completes, check this flag and return an appropriate cancellation or timeout
error instead of allowing execution to continue and report ErrCodeNotFound.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d1518207-886e-4d30-92bf-f88163f2d1e6

📥 Commits

Reviewing files that changed from the base of the PR and between 2154c2c and a2f8562.

📒 Files selected for processing (2)
  • docs/user/validation.md
  • validators/conformance/pod_autoscaling_check.go

Comment thread validators/conformance/pod_autoscaling_check.go
@yuanchen8911 yuanchen8911 force-pushed the fix/pod-autoscaling-dra-aware branch from a2f8562 to ba602e6 Compare June 22, 2026 18:21

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 155-158: The inline comment at lines 156-158 describing
cancellation behavior is outdated and does not match the actual control flow.
The comment states that context cancellation should result in a
"warn-and-continue" behavior, but the actual code at lines 171-174 hard-fails
with ErrCodeTimeout when ctx.Ctx is canceled. Update the comment to accurately
reflect that cancellation now causes a hard failure with ErrCodeTimeout instead
of the warn-and-continue behavior previously described.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 90c6fc33-42df-4b07-8c09-4fc3300e9399

📥 Commits

Reviewing files that changed from the base of the PR and between a2f8562 and ba602e6.

📒 Files selected for processing (2)
  • docs/user/validation.md
  • validators/conformance/pod_autoscaling_check.go

Comment thread validators/conformance/pod_autoscaling_check.go Outdated
The pod-autoscaling conformance check (CNCF #8b) hard-failed when pod-scoped
GPU custom metrics (custom.metrics.k8s.io/.../pods/*/...) returned no data,
with 'no GPU custom metrics available (DCGM -> Prometheus -> adapter pipeline
broken)'. Those per-pod series require dcgm-exporter to attribute each GPU to
its consuming pod via the kubelet pod-resources API, which only covers
device-plugin (nvidia.com/gpu) allocations. DRA-claimed GPUs
(nvidia-dra-driver-gpu, K8s 1.34+) are not attributed, so the series carry no
exported_pod label and the adapter returns empty — failing the check on
otherwise-healthy DRA clusters.

The autoscaling capability is proven authoritatively by the external-metrics
API plus the HPA behavioral scale-up/down test, which drive off a cluster-wide
GPU metric (dcgm_gpu_power_usage) and need no per-pod attribution. Demote the
pod-scoped custom-metrics probe to best-effort (warn + record artifact, then
continue); steps 3 (external metrics) and 4 (HPA behavioral) remain hard,
fail-closed gates.

Verified on aicr3 (EKS H100, K8s 1.35, nvidia-dra-driver-gpu): external
metrics returned data and the HPA path was healthy while pod-scoped metrics
were empty.
@yuanchen8911 yuanchen8911 force-pushed the fix/pod-autoscaling-dra-aware branch from ba602e6 to abd90b4 Compare June 22, 2026 18:34

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The gate demotion is done right: step 2 (pod-scoped GPU custom metrics) becomes best-effort warn-and-continue, but the authoritative capability proof — step 3 (external-metrics API has data) and step 4 (HPA behavioral scale-up/down on dcgm_gpu_power_usage) — stays fail-closed and unchanged. So this doesn't open a false-PASS hole; if anything the behavioral test is stronger evidence than a pod-scoped metric presence check ever was.

Verified the fail-closed paths hold:

  • Context cancel/deadline still returns a timeout via the if !found && ctx.Ctx.Err() != nil guard; the labeled break pollLoop cleanly separates that from plain attempt-exhaustion.
  • A genuinely broken DCGM→Prometheus→adapter pipeline on a device-plugin cluster is still caught downstream by step 3 (the external metric flows through the same pipeline), so the relaxation is scoped to the DRA pod-attribution gap as intended.
  • Evidence artifact recorded in both branches — conformance audit trail preserved.
  • maxAttempts 12→6 (60s) still covers one ~30s adapter relist on device-plugin clusters where the metric will appear; reasonable given it's no longer a gate.

godoc, inline comments, and the docs/user/validation.md row are all updated coherently, and the e2e (11/11 on a DRA EKS H100 cluster) backs the claim. Nothing blocking.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@validators/conformance/pod_autoscaling_check.go`:
- Around line 171-174: The context cancellation check in the condition is
coupled with the !found check, which allows canceled contexts to be
misclassified as missing metrics when found is true. Separate the context
cancellation check from the found check by first checking if ctx.Ctx.Err() is
not nil regardless of the found state, and return the timeout error immediately
when the context is canceled, then handle the !found case separately afterwards.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4b3b1a9d-2af3-466b-a433-dc34cc98b157

📥 Commits

Reviewing files that changed from the base of the PR and between ba602e6 and abd90b4.

📒 Files selected for processing (2)
  • docs/user/validation.md
  • validators/conformance/pod_autoscaling_check.go

Comment on lines +171 to +174
if !found && ctx.Ctx.Err() != nil {
return errors.Wrap(errors.ErrCodeTimeout,
"timed out waiting for GPU custom metrics", ctx.Ctx.Err())
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle canceled context regardless of found state.

Cancellation is only converted to ErrCodeTimeout when !found. If found == true and ctx.Ctx is already canceled, Line 199+ can return ErrCodeNotFound, misclassifying cancellation as missing metrics.

Suggested fix
-	if !found && ctx.Ctx.Err() != nil {
+	if ctx.Ctx.Err() != nil {
 		return errors.Wrap(errors.ErrCodeTimeout,
 			"timed out waiting for GPU custom metrics", ctx.Ctx.Err())
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@validators/conformance/pod_autoscaling_check.go` around lines 171 - 174, The
context cancellation check in the condition is coupled with the !found check,
which allows canceled contexts to be misclassified as missing metrics when found
is true. Separate the context cancellation check from the found check by first
checking if ctx.Ctx.Err() is not nil regardless of the found state, and return
the timeout error immediately when the context is canceled, then handle the
!found case separately afterwards.

@yuanchen8911 yuanchen8911 enabled auto-merge (squash) June 22, 2026 18:37
@yuanchen8911 yuanchen8911 merged commit 410e31a into NVIDIA:main Jun 22, 2026
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pod-autoscaling conformance fails on DRA clusters (empty pod-scoped GPU custom metrics)

2 participants