feat(recipes): add deployment validation to GB200/EKS recipes#895
Conversation
H100/EKS overlays already declare an explicit `validation.deployment` block at the intent layer (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi + a gpu-operator version constraint). GB200/EKS was relying only on auto-discovered component health-checks for the deployment phase, with no version floor. Add the same four-check block to gb200-eks-training.yaml and gb200-eks-inference.yaml at the intent layer so every OS+platform variant inherits it (ubuntu-training, ubuntu-training-kubeflow, ubuntu-inference, ubuntu-inference-dynamo). Version floor is `Deployment.gpu-operator.version >= v25.10.0`. GB200 support stabilized in gpu-operator v25.10; the chart pin in the resolved GB200/EKS recipes is currently v25.10.1, so the constraint floors at the same minor — matches the H100 convention of pinning one patch below the validated chart version.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis PR adds deployment validation sections to two GB200 EKS recipe overlays. The Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Add an explicit
validation.deploymentblock to the GB200/EKS intent-layer overlays so every OS+platform leaf (ubuntu-training, ubuntu-training-kubeflow, ubuntu-inference, ubuntu-inference-dynamo) inherits the four deployment-phase checks H100/EKS already runs, plus a GB200-specific gpu-operator version floor.Motivation / Context
H100/EKS overlays (
h100-eks-training.yaml, the H100/EKS inference variants, and the AKS siblings) declare:gb200-eks-training.yamlhadperformance+conformanceblocks but nodeployment, andgb200-eks-inference.yamlhad novalidationblock at all. Resolved recipes were running only auto-discovered component health-checks, with no explicit gpu-operator version floor — so a regression to gpu-operator v25.9 or earlier (which is missing GB200 stability fixes) would silently validate clean.Mirroring the H100 pattern at the intent layer keeps the rule defined once and inherited by every OS/platform leaf.
Fixes: N/A
Related: matches the convention in
h100-eks-training.yaml,h100-eks-ubuntu-inference-{nim,dynamo}.yamlType of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)recipes/overlays/gb200-eks-*.yaml)Implementation Notes
Version floor. The constraint is
Deployment.gpu-operator.version >= v25.10.0, not the H100 number>= v24.6.0. GB200 support stabilized in gpu-operator v25.10; the chart pin in the resolved GB200/EKS recipes today isv25.10.1, so the constraint floors at the same minor (one patch below the validated chart version). This matches the H100 convention of pinning the floor to the lowest gpu-operator known to host the target accelerator cleanly.Layering. Block goes at the intent layer (
gb200-eks-training.yaml,gb200-eks-inference.yaml) so all downstream OS+platform leaves inherit. Matches the comment in the existing H100 EKS training overlay ("Defined at the intent layer (not OS-specific) so all OS variants inherit them").No new check definitions. The four check names (
operator-health,expected-resources,gpu-operator-version,check-nvidia-smi) all already exist inpkg/validator/catalog— this PR only adds them to the GB200/EKS validation lists.Testing
make qualify # cleanLive cluster validation — ran the new deployment-phase block against a real GB200/EKS cluster (6 nodes, K8s 1.34.7-eks, gpu-operator v25.10.1):
aicr recipe --service eks --accelerator gb200 --os ubuntu \ --intent training --platform kubeflow -o recipe.yaml KUBECONFIG=<gb200-eks-kubeconfig> \ AICR_VALIDATOR_IMAGE_TAG=edge \ aicr validate --recipe recipe.yaml --phase deploymentResult: phase passed, 4/4 checks, 26.4s.
operator-healthexpected-resourcesgpu-operator-versioncheck-nvidia-smiThe
gpu-operator-versioncheck exercises the new constraint and confirms the v25.10.x chart pin clears the v25.10.0 floor.Risk Assessment
Two YAML overlays. Constraint values are recipe-data, no code path changes. Tightens validation strictness — if the gpu-operator chart pin ever drops below v25.10.0 the validate step will catch it instead of silently shipping a stale operator. Revert is the inverse diff.
Rollout notes: None. No migration, no flag, no compatibility shim. Bundles produced against the current resolved recipe (gpu-operator v25.10.1) already clear the new floor.
Checklist
make testwith-race)make lint)git commit -S)