feat(ci): add metrics-driven cluster autoscaling validation with Karpenter + KWOK#168
Conversation
mchmarny
left a comment
There was a problem hiding this comment.
The approach is sound — validating the full metrics-driven autoscaling chain (DCGM → Prometheus → prometheus-adapter → HPA → Karpenter → KWOK nodes) is a thorough conformance test. The install script is well-structured with good error handling and diagnostics. The prometheus-adapter external metrics rules and workload-attributed pod metrics are correctly configured.
Main concerns:
- ~135 lines of shell duplicated verbatim between inference and training workflows — extract to a reusable script in
kwok/scripts/following the existing pattern - Consolidation test silently passes on failure — the only validation step without a failure assertion
gpu-scale-test.yamlis unused — not referenced by any workflow or script
Minor:
4. ko build 2>&1 | tail -1 is fragile if ko emits warnings to stderr
5. cd without pushd/popd in build_karpenter()
6. Spot offerings in instance-types.json are dead data given the on-demand-only NodePool
e9d9fb2 to
6397834
Compare
6397834 to
d82eb32
Compare
5dc5f66 to
4707110
Compare
cb3e596 to
67cd47b
Compare
503ad00 to
00cc551
Compare
…enter + KWOK
Add cluster autoscaling validation to both H100 GPU workflows (inference
and training). The test validates the full metrics-driven autoscaling chain:
DCGM metrics → Prometheus → prometheus-adapter (external metric)
→ HPA scales Deployment → pending pods → Karpenter → KWOK nodes
New files:
- kwok/scripts/install-karpenter-kwok.sh: builds Karpenter KWOK
provider via ko and deploys with Helm into kind clusters
- kwok/scripts/validate-cluster-autoscaling.sh: reusable E2E script
that verifies external metrics, HPA scaling, node provisioning,
pod scheduling, and scale-down consolidation
- kwok/manifests/karpenter/: NodePool, KWOKNodeClass, HPA test
workload, and GPU instance type definitions
Changed files:
- recipes/components/prometheus-adapter/values.yaml: add workload-
attributed custom metrics, external metrics rules for cluster-wide
GPU metrics (power_usage, memory_used, utilization) with
namespaced: false, and 30s metrics relist interval
- .github/workflows/gpu-h100-{inference,training}-test.yaml: add
cluster autoscaling step and trigger paths for karpenter manifests
- .settings.yaml: add karpenter v1.8.0 to testing_tools
00cc551 to
664ee37
Compare
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
Summary
Add CNCF AI Conformance #8a (
cluster_autoscaling) validation to both H100 GPU CI workflows(inference and training). Validates the full metrics-driven autoscaling chain end-to-end:
Also adds external metrics rules and workload-attributed custom metrics to prometheus-adapter,
enabling HPA-based GPU autoscaling from any namespace.
Fixes custom metrics validation to handle DCGM exporter pod-mapping (which relabels metrics
with the GPU workload's namespace when a GPU is in use) and adds retry logic for
prometheus-adapter metric discovery timing.
Adds kai-scheduler
dynamoQueue CR creation before DynamoGraphDeployment in the inferenceworkflow (grove-operator sets
kai.scheduler/queue=dynamoon pods, but the kai-schedulerchart only creates
default-parent-queueanddefault-queue).Changes
New files:
kwok/scripts/install-karpenter-kwok.sh— Builds Karpenter KWOK provider from source viako, side-loads into kind, deploys via Helm with GPU instance typeskwok/scripts/validate-cluster-autoscaling.sh— End-to-end cluster autoscaling validation script (external metrics → HPA → Karpenter → KWOK nodes → consolidation)kwok/manifests/karpenter/instance-types.json— GPU instance types for Karpenter KWOK provider (p5.48xlarge 8×GPU, g5.xlarge 1×GPU, g5.2xlarge 1×GPU)kwok/manifests/karpenter/nodepool.yaml— NodePool with GPU taint + KWOKNodeClasskwok/manifests/karpenter/hpa-gpu-scale-test.yaml— Deployment + HPA using externaldcgm_gpu_power_usagemetricModified files:
recipes/components/prometheus-adapter/values.yaml— Add external metrics rules (dcgm_gpu_utilization,dcgm_gpu_memory_used,dcgm_gpu_power_usage) and workload-attributed custom metrics (workload_gpu_utilization,workload_gpu_memory_used).github/workflows/gpu-h100-inference-test.yaml— Add cluster autoscaling step, Dynamo kai-scheduler queue, custom metrics namespace/retry fix, trigger paths.github/workflows/gpu-h100-training-test.yaml— Add cluster autoscaling step, trigger paths.settings.yaml— Addkarpenter: v1.8.0to testing_toolsHow it works
ko build sigs.k8s.io/karpenter/kwok) and deployed into kind clusterdcgm_gpu_power_usageconfirmed available viaexternal.metrics.k8s.ioAPInvidia.com/gpucapacityCI Results
Both H100 workflows pass all steps on commit
00cc5512:Inference (24 steps): Dynamo deploy + inference, accelerator metrics, custom metrics (pod autoscaling), cluster autoscaling (Karpenter+KWOK), DRA GPU test, secure accelerator access, conformance evidence (54/54 resources)
Training (19 steps): Gang scheduling (2× H100 NVL), cluster autoscaling (Karpenter+KWOK), conformance evidence (39/39 resources)
Test plan
dcgm_gpu_power_usage)desiredReplicas > 1nvidia.com/gpucapacity