Skip to content

feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK#168

Merged
dims merged 1 commit into
NVIDIA:mainfrom
dims:feat/cluster-autoscaling-karpenter-kwok
Feb 21, 2026
Merged

feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK#168
dims merged 1 commit into
NVIDIA:mainfrom
dims:feat/cluster-autoscaling-karpenter-kwok

Conversation

@dims

@dims dims commented Feb 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

Add CNCF AI Conformance #8a (cluster_autoscaling) validation to both H100 GPU CI workflows
(inference and training). Validates the full metrics-driven autoscaling chain end-to-end:

DCGM metrics → Prometheus → prometheus-adapter (external metric) → HPA scales Deployment
  → pending GPU pods → Karpenter provisions KWOK nodes → pods schedule → consolidation

Also adds external metrics rules and workload-attributed custom metrics to prometheus-adapter,
enabling HPA-based GPU autoscaling from any namespace.

Fixes custom metrics validation to handle DCGM exporter pod-mapping (which relabels metrics
with the GPU workload's namespace when a GPU is in use) and adds retry logic for
prometheus-adapter metric discovery timing.

Adds kai-scheduler dynamo Queue CR creation before DynamoGraphDeployment in the inference
workflow (grove-operator sets kai.scheduler/queue=dynamo on pods, but the kai-scheduler
chart only creates default-parent-queue and default-queue).

Changes

New files:

  • kwok/scripts/install-karpenter-kwok.sh — Builds Karpenter KWOK provider from source via ko, side-loads into kind, deploys via Helm with GPU instance types
  • kwok/scripts/validate-cluster-autoscaling.sh — End-to-end cluster autoscaling validation script (external metrics → HPA → Karpenter → KWOK nodes → consolidation)
  • kwok/manifests/karpenter/instance-types.json — GPU instance types for Karpenter KWOK provider (p5.48xlarge 8×GPU, g5.xlarge 1×GPU, g5.2xlarge 1×GPU)
  • kwok/manifests/karpenter/nodepool.yaml — NodePool with GPU taint + KWOKNodeClass
  • kwok/manifests/karpenter/hpa-gpu-scale-test.yaml — Deployment + HPA using external dcgm_gpu_power_usage metric

Modified files:

  • recipes/components/prometheus-adapter/values.yaml — Add external metrics rules (dcgm_gpu_utilization, dcgm_gpu_memory_used, dcgm_gpu_power_usage) and workload-attributed custom metrics (workload_gpu_utilization, workload_gpu_memory_used)
  • .github/workflows/gpu-h100-inference-test.yaml — Add cluster autoscaling step, Dynamo kai-scheduler queue, custom metrics namespace/retry fix, trigger paths
  • .github/workflows/gpu-h100-training-test.yaml — Add cluster autoscaling step, trigger paths
  • .settings.yaml — Add karpenter: v1.8.0 to testing_tools

How it works

  1. Install: Karpenter KWOK provider built from source (ko build sigs.k8s.io/karpenter/kwok) and deployed into kind cluster
  2. Configure: NodePool + KWOKNodeClass created with GPU taints; GPU instance types loaded via ConfigMap
  3. Verify metrics: External metric dcgm_gpu_power_usage confirmed available via external.metrics.k8s.io API
  4. Scale up: HPA reads real GPU power metric, scales Deployment beyond node capacity → overflow pods trigger Karpenter
  5. Provision: Karpenter provisions simulated KWOK nodes with nvidia.com/gpu capacity
  6. Schedule: All GPU pods schedule onto KWOK nodes
  7. Scale down: After cleanup, Karpenter consolidates KWOK nodes back to zero

CI Results

Both H100 workflows pass all steps on commit 00cc5512:

Inference (24 steps): Dynamo deploy + inference, accelerator metrics, custom metrics (pod autoscaling), cluster autoscaling (Karpenter+KWOK), DRA GPU test, secure accelerator access, conformance evidence (54/54 resources)

Training (19 steps): Gang scheduling (2× H100 NVL), cluster autoscaling (Karpenter+KWOK), conformance evidence (39/39 resources)

Test plan

  • H100 inference workflow: cluster autoscaling step passes
  • H100 training workflow: cluster autoscaling step passes
  • prometheus-adapter serves external metrics (dcgm_gpu_power_usage)
  • HPA reads external metric and computes desiredReplicas > 1
  • Karpenter provisions KWOK nodes with nvidia.com/gpu capacity
  • All GPU pods schedule onto KWOK nodes
  • Scale-down consolidation removes KWOK nodes
  • Custom metrics validation handles DCGM pod-mapping namespaces
  • Dynamo deployment works with kai-scheduler queue
  • All other existing steps continue to pass (no regressions)

Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach is sound — validating the full metrics-driven autoscaling chain (DCGM → Prometheus → prometheus-adapter → HPA → Karpenter → KWOK nodes) is a thorough conformance test. The install script is well-structured with good error handling and diagnostics. The prometheus-adapter external metrics rules and workload-attributed pod metrics are correctly configured.

Main concerns:

  1. ~135 lines of shell duplicated verbatim between inference and training workflows — extract to a reusable script in kwok/scripts/ following the existing pattern
  2. Consolidation test silently passes on failure — the only validation step without a failure assertion
  3. gpu-scale-test.yaml is unused — not referenced by any workflow or script

Minor:
4. ko build 2>&1 | tail -1 is fragile if ko emits warnings to stderr
5. cd without pushd/popd in build_karpenter()
6. Spot offerings in instance-types.json are dead data given the on-demand-only NodePool

Comment thread .github/workflows/gpu-h100-training-test.yaml Outdated
Comment thread .github/workflows/gpu-h100-inference-test.yaml Outdated
Comment thread kwok/manifests/karpenter/gpu-scale-test.yaml Outdated
Comment thread kwok/scripts/install-karpenter-kwok.sh Outdated
Comment thread kwok/scripts/install-karpenter-kwok.sh Outdated
Comment thread kwok/manifests/karpenter/nodepool.yaml
@dims dims changed the title feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK [DO-NOT-MERGE] feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK Feb 20, 2026
Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed
Comment thread kwok/manifests/karpenter/hpa-gpu-scale-test.yaml Fixed
@dims dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch 7 times, most recently from 503ad00 to 00cc551 Compare February 21, 2026 17:47
@dims dims changed the title [DO-NOT-MERGE] feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK Feb 21, 2026
…enter + KWOK

Add cluster autoscaling validation to both H100 GPU workflows (inference
and training). The test validates the full metrics-driven autoscaling chain:

  DCGM metrics → Prometheus → prometheus-adapter (external metric)
  → HPA scales Deployment → pending pods → Karpenter → KWOK nodes

New files:
- kwok/scripts/install-karpenter-kwok.sh: builds Karpenter KWOK
  provider via ko and deploys with Helm into kind clusters
- kwok/scripts/validate-cluster-autoscaling.sh: reusable E2E script
  that verifies external metrics, HPA scaling, node provisioning,
  pod scheduling, and scale-down consolidation
- kwok/manifests/karpenter/: NodePool, KWOKNodeClass, HPA test
  workload, and GPU instance type definitions

Changed files:
- recipes/components/prometheus-adapter/values.yaml: add workload-
  attributed custom metrics, external metrics rules for cluster-wide
  GPU metrics (power_usage, memory_used, utilization) with
  namespaced: false, and 30s metrics relist interval
- .github/workflows/gpu-h100-{inference,training}-test.yaml: add
  cluster autoscaling step and trigger paths for karpenter manifests
- .settings.yaml: add karpenter v1.8.0 to testing_tools
@dims dims force-pushed the feat/cluster-autoscaling-karpenter-kwok branch from 00cc551 to 664ee37 Compare February 21, 2026 18:33
@dims dims merged commit d474757 into NVIDIA:main Feb 21, 2026
33 checks passed
@github-actions

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 23, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants