feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK by dims · Pull Request #168 · NVIDIA/aicr

dims · 2026-02-20T16:07:50Z

Summary

Add CNCF AI Conformance #8a (cluster_autoscaling) validation to both H100 GPU CI workflows
(inference and training). Validates the full metrics-driven autoscaling chain end-to-end:

DCGM metrics → Prometheus → prometheus-adapter (external metric) → HPA scales Deployment
  → pending GPU pods → Karpenter provisions KWOK nodes → pods schedule → consolidation

Also adds external metrics rules and workload-attributed custom metrics to prometheus-adapter,
enabling HPA-based GPU autoscaling from any namespace.

Fixes custom metrics validation to handle DCGM exporter pod-mapping (which relabels metrics
with the GPU workload's namespace when a GPU is in use) and adds retry logic for
prometheus-adapter metric discovery timing.

Adds kai-scheduler dynamo Queue CR creation before DynamoGraphDeployment in the inference
workflow (grove-operator sets kai.scheduler/queue=dynamo on pods, but the kai-scheduler
chart only creates default-parent-queue and default-queue).

Changes

New files:

kwok/scripts/install-karpenter-kwok.sh — Builds Karpenter KWOK provider from source via ko, side-loads into kind, deploys via Helm with GPU instance types
kwok/scripts/validate-cluster-autoscaling.sh — End-to-end cluster autoscaling validation script (external metrics → HPA → Karpenter → KWOK nodes → consolidation)
kwok/manifests/karpenter/instance-types.json — GPU instance types for Karpenter KWOK provider (p5.48xlarge 8×GPU, g5.xlarge 1×GPU, g5.2xlarge 1×GPU)
kwok/manifests/karpenter/nodepool.yaml — NodePool with GPU taint + KWOKNodeClass
kwok/manifests/karpenter/hpa-gpu-scale-test.yaml — Deployment + HPA using external dcgm_gpu_power_usage metric

Modified files:

recipes/components/prometheus-adapter/values.yaml — Add external metrics rules (dcgm_gpu_utilization, dcgm_gpu_memory_used, dcgm_gpu_power_usage) and workload-attributed custom metrics (workload_gpu_utilization, workload_gpu_memory_used)
.github/workflows/gpu-h100-inference-test.yaml — Add cluster autoscaling step, Dynamo kai-scheduler queue, custom metrics namespace/retry fix, trigger paths
.github/workflows/gpu-h100-training-test.yaml — Add cluster autoscaling step, trigger paths
.settings.yaml — Add karpenter: v1.8.0 to testing_tools

How it works

Install: Karpenter KWOK provider built from source (ko build sigs.k8s.io/karpenter/kwok) and deployed into kind cluster
Configure: NodePool + KWOKNodeClass created with GPU taints; GPU instance types loaded via ConfigMap
Verify metrics: External metric dcgm_gpu_power_usage confirmed available via external.metrics.k8s.io API
Scale up: HPA reads real GPU power metric, scales Deployment beyond node capacity → overflow pods trigger Karpenter
Provision: Karpenter provisions simulated KWOK nodes with nvidia.com/gpu capacity
Schedule: All GPU pods schedule onto KWOK nodes
Scale down: After cleanup, Karpenter consolidates KWOK nodes back to zero

CI Results

Both H100 workflows pass all steps on commit 00cc5512:

Inference (24 steps): Dynamo deploy + inference, accelerator metrics, custom metrics (pod autoscaling), cluster autoscaling (Karpenter+KWOK), DRA GPU test, secure accelerator access, conformance evidence (54/54 resources)

Training (19 steps): Gang scheduling (2× H100 NVL), cluster autoscaling (Karpenter+KWOK), conformance evidence (39/39 resources)

Test plan

mchmarny

The approach is sound — validating the full metrics-driven autoscaling chain (DCGM → Prometheus → prometheus-adapter → HPA → Karpenter → KWOK nodes) is a thorough conformance test. The install script is well-structured with good error handling and diagnostics. The prometheus-adapter external metrics rules and workload-attributed pod metrics are correctly configured.

Main concerns:

~135 lines of shell duplicated verbatim between inference and training workflows — extract to a reusable script in kwok/scripts/ following the existing pattern
Consolidation test silently passes on failure — the only validation step without a failure assertion
gpu-scale-test.yaml is unused — not referenced by any workflow or script

Minor:
4. ko build 2>&1 | tail -1 is fragile if ko emits warnings to stderr
5. cd without pushd/popd in build_karpenter()
6. Spot offerings in instance-types.json are dead data given the on-demand-only NodePool

…enter + KWOK Add cluster autoscaling validation to both H100 GPU workflows (inference and training). The test validates the full metrics-driven autoscaling chain: DCGM metrics → Prometheus → prometheus-adapter (external metric) → HPA scales Deployment → pending pods → Karpenter → KWOK nodes New files: - kwok/scripts/install-karpenter-kwok.sh: builds Karpenter KWOK provider via ko and deploys with Helm into kind clusters - kwok/scripts/validate-cluster-autoscaling.sh: reusable E2E script that verifies external metrics, HPA scaling, node provisioning, pod scheduling, and scale-down consolidation - kwok/manifests/karpenter/: NodePool, KWOKNodeClass, HPA test workload, and GPU instance type definitions Changed files: - recipes/components/prometheus-adapter/values.yaml: add workload- attributed custom metrics, external metrics rules for cluster-wide GPU metrics (power_usage, memory_used, utilization) with namespaced: false, and 30s metrics relist interval - .github/workflows/gpu-h100-{inference,training}-test.yaml: add cluster autoscaling step and trigger paths for karpenter manifests - .settings.yaml: add karpenter v1.8.0 to testing_tools

github-actions · 2026-05-23T06:53:10Z

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

dims requested review from a team as code owners February 20, 2026 16:07

github-actions Bot added area/recipes area/ci size/XL labels Feb 20, 2026

github-advanced-security AI found potential problems Feb 20, 2026

View reviewed changes

mchmarny reviewed Feb 20, 2026

View reviewed changes

dims changed the title ~~feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK~~ [DO-NOT-MERGE] feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK Feb 20, 2026

github-advanced-security AI found potential problems Feb 20, 2026

View reviewed changes

Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed

Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed

Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed

dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch from e9d9fb2 to 6397834 Compare February 20, 2026 22:06

github-actions Bot added area/tests area/bundler labels Feb 20, 2026

dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch from 6397834 to d82eb32 Compare February 20, 2026 22:08

github-actions Bot removed area/tests area/bundler labels Feb 20, 2026

dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch 4 times, most recently from 5dc5f66 to 4707110 Compare February 20, 2026 23:22

github-actions Bot added area/tests area/validator area/docs area/bundler area/cli area/api area/collector area/infra labels Feb 21, 2026

dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch from cb3e596 to 67cd47b Compare February 21, 2026 00:52

github-actions Bot removed area/tests area/validator labels Feb 21, 2026

github-actions Bot removed area/docs area/bundler area/cli area/api area/collector area/infra labels Feb 21, 2026

dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch 7 times, most recently from 503ad00 to 00cc551 Compare February 21, 2026 17:47

dims changed the title ~~[DO-NOT-MERGE] feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK~~ feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK Feb 21, 2026

dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch from 00cc551 to 664ee37 Compare February 21, 2026 18:33

dims merged commit d474757 into NVIDIA:main Feb 21, 2026
33 checks passed

github-actions Bot locked as resolved and limited conversation to collaborators May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK#168

feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK#168
dims merged 1 commit into
NVIDIA:mainfrom
dims:feat/cluster-autoscaling-karpenter-kwok

dims commented Feb 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mchmarny left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dims commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

How it works

CI Results

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dims commented Feb 20, 2026 •

edited

Loading