Skip to content

feat(recipes): add deployment validation to GB200/EKS recipes#895

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:recipes/gb200-eks-deployment-validation
May 14, 2026
Merged

feat(recipes): add deployment validation to GB200/EKS recipes#895
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:recipes/gb200-eks-deployment-validation

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Add an explicit validation.deployment block to the GB200/EKS intent-layer overlays so every OS+platform leaf (ubuntu-training, ubuntu-training-kubeflow, ubuntu-inference, ubuntu-inference-dynamo) inherits the four deployment-phase checks H100/EKS already runs, plus a GB200-specific gpu-operator version floor.

Motivation / Context

H100/EKS overlays (h100-eks-training.yaml, the H100/EKS inference variants, and the AKS siblings) declare:

validation:
  deployment:
    checks: [operator-health, expected-resources, gpu-operator-version, check-nvidia-smi]
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"

gb200-eks-training.yaml had performance + conformance blocks but no deployment, and gb200-eks-inference.yaml had no validation block at all. Resolved recipes were running only auto-discovered component health-checks, with no explicit gpu-operator version floor — so a regression to gpu-operator v25.9 or earlier (which is missing GB200 stability fixes) would silently validate clean.

Mirroring the H100 pattern at the intent layer keeps the rule defined once and inherited by every OS/platform leaf.

Fixes: N/A
Related: matches the convention in h100-eks-training.yaml, h100-eks-ubuntu-inference-{nim,dynamo}.yaml

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: recipes (recipes/overlays/gb200-eks-*.yaml)

Implementation Notes

Version floor. The constraint is Deployment.gpu-operator.version >= v25.10.0, not the H100 number >= v24.6.0. GB200 support stabilized in gpu-operator v25.10; the chart pin in the resolved GB200/EKS recipes today is v25.10.1, so the constraint floors at the same minor (one patch below the validated chart version). This matches the H100 convention of pinning the floor to the lowest gpu-operator known to host the target accelerator cleanly.

Layering. Block goes at the intent layer (gb200-eks-training.yaml, gb200-eks-inference.yaml) so all downstream OS+platform leaves inherit. Matches the comment in the existing H100 EKS training overlay ("Defined at the intent layer (not OS-specific) so all OS variants inherit them").

No new check definitions. The four check names (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi) all already exist in pkg/validator/catalog — this PR only adds them to the GB200/EKS validation lists.

Testing

make qualify  # clean

Live cluster validation — ran the new deployment-phase block against a real GB200/EKS cluster (6 nodes, K8s 1.34.7-eks, gpu-operator v25.10.1):

aicr recipe --service eks --accelerator gb200 --os ubuntu \
            --intent training --platform kubeflow -o recipe.yaml
KUBECONFIG=<gb200-eks-kubeconfig> \
  AICR_VALIDATOR_IMAGE_TAG=edge \
  aicr validate --recipe recipe.yaml --phase deployment

Result: phase passed, 4/4 checks, 26.4s.

Check Status
operator-health passed
expected-resources passed
gpu-operator-version passed (v25.10.1 ≥ v25.10.0)
check-nvidia-smi passed

The gpu-operator-version check exercises the new constraint and confirms the v25.10.x chart pin clears the v25.10.0 floor.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Two YAML overlays. Constraint values are recipe-data, no code path changes. Tightens validation strictness — if the gpu-operator chart pin ever drops below v25.10.0 the validate step will catch it instead of silently shipping a stale operator. Revert is the inverse diff.

Rollout notes: None. No migration, no flag, no compatibility shim. Bundles produced against the current resolved recipe (gpu-operator v25.10.1) already clear the new floor.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

H100/EKS overlays already declare an explicit `validation.deployment`
block at the intent layer (operator-health, expected-resources,
gpu-operator-version, check-nvidia-smi + a gpu-operator version
constraint). GB200/EKS was relying only on auto-discovered component
health-checks for the deployment phase, with no version floor.

Add the same four-check block to gb200-eks-training.yaml and
gb200-eks-inference.yaml at the intent layer so every OS+platform
variant inherits it (ubuntu-training, ubuntu-training-kubeflow,
ubuntu-inference, ubuntu-inference-dynamo).

Version floor is `Deployment.gpu-operator.version >= v25.10.0`.
GB200 support stabilized in gpu-operator v25.10; the chart pin in
the resolved GB200/EKS recipes is currently v25.10.1, so the
constraint floors at the same minor — matches the H100 convention
of pinning one patch below the validated chart version.
@njhensley njhensley requested a review from a team as a code owner May 14, 2026 20:25
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fca4ee04-f3e7-411f-8076-311f1ce96a3a

📥 Commits

Reviewing files that changed from the base of the PR and between 1f3e1c2 and 874b4ea.

📒 Files selected for processing (2)
  • recipes/overlays/gb200-eks-inference.yaml
  • recipes/overlays/gb200-eks-training.yaml

📝 Walkthrough

Walkthrough

This PR adds deployment validation sections to two GB200 EKS recipe overlays. The gb200-eks-inference.yaml and gb200-eks-training.yaml files now each include a validation layer with four deployment checks—operator health, expected resources, GPU operator version, and nvidia-smi—plus a constraint requiring GPU operator version >= v25.10.0. These changes establish consistent validation requirements and a minimum GPU operator version baseline for both training and inference workloads.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

area/recipes, size/M

Suggested reviewers

  • lalitadithya
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically summarizes the main change: adding deployment validation to GB200/EKS recipes.
Description check ✅ Passed The description is comprehensive and clearly related to the changeset, explaining the motivation, implementation details, testing approach, and risk assessment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@njhensley njhensley self-assigned this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants