feat(recipes): add deployment validation to GB200/EKS recipes by njhensley · Pull Request #895 · NVIDIA/aicr

njhensley · 2026-05-14T20:25:49Z

Summary

Add an explicit validation.deployment block to the GB200/EKS intent-layer overlays so every OS+platform leaf (ubuntu-training, ubuntu-training-kubeflow, ubuntu-inference, ubuntu-inference-dynamo) inherits the four deployment-phase checks H100/EKS already runs, plus a GB200-specific gpu-operator version floor.

Motivation / Context

H100/EKS overlays (h100-eks-training.yaml, the H100/EKS inference variants, and the AKS siblings) declare:

validation:
  deployment:
    checks: [operator-health, expected-resources, gpu-operator-version, check-nvidia-smi]
    constraints:
      - name: Deployment.gpu-operator.version
        value: ">= v24.6.0"

gb200-eks-training.yaml had performance + conformance blocks but no deployment, and gb200-eks-inference.yaml had no validation block at all. Resolved recipes were running only auto-discovered component health-checks, with no explicit gpu-operator version floor — so a regression to gpu-operator v25.9 or earlier (which is missing GB200 stability fixes) would silently validate clean.

Mirroring the H100 pattern at the intent layer keeps the rule defined once and inherited by every OS/platform leaf.

Fixes: N/A
Related: matches the convention in h100-eks-training.yaml, h100-eks-ubuntu-inference-{nim,dynamo}.yaml

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: recipes (recipes/overlays/gb200-eks-*.yaml)

Implementation Notes

Version floor. The constraint is Deployment.gpu-operator.version >= v25.10.0, not the H100 number >= v24.6.0. GB200 support stabilized in gpu-operator v25.10; the chart pin in the resolved GB200/EKS recipes today is v25.10.1, so the constraint floors at the same minor (one patch below the validated chart version). This matches the H100 convention of pinning the floor to the lowest gpu-operator known to host the target accelerator cleanly.

Layering. Block goes at the intent layer (gb200-eks-training.yaml, gb200-eks-inference.yaml) so all downstream OS+platform leaves inherit. Matches the comment in the existing H100 EKS training overlay ("Defined at the intent layer (not OS-specific) so all OS variants inherit them").

No new check definitions. The four check names (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi) all already exist in pkg/validator/catalog — this PR only adds them to the GB200/EKS validation lists.

Testing

make qualify  # clean

Live cluster validation — ran the new deployment-phase block against a real GB200/EKS cluster (6 nodes, K8s 1.34.7-eks, gpu-operator v25.10.1):

aicr recipe --service eks --accelerator gb200 --os ubuntu \
            --intent training --platform kubeflow -o recipe.yaml
KUBECONFIG=<gb200-eks-kubeconfig> \
  AICR_VALIDATOR_IMAGE_TAG=edge \
  aicr validate --recipe recipe.yaml --phase deployment

Result: phase passed, 4/4 checks, 26.4s.

Check	Status
`operator-health`	passed
`expected-resources`	passed
`gpu-operator-version`	passed (v25.10.1 ≥ v25.10.0)
`check-nvidia-smi`	passed

The gpu-operator-version check exercises the new constraint and confirms the v25.10.x chart pin clears the v25.10.0 floor.

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Two YAML overlays. Constraint values are recipe-data, no code path changes. Tightens validation strictness — if the gpu-operator chart pin ever drops below v25.10.0 the validate step will catch it instead of silently shipping a stale operator. Revert is the inverse diff.

Rollout notes: None. No migration, no flag, no compatibility shim. Bundles produced against the current resolved recipe (gpu-operator v25.10.1) already clear the new floor.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

H100/EKS overlays already declare an explicit `validation.deployment` block at the intent layer (operator-health, expected-resources, gpu-operator-version, check-nvidia-smi + a gpu-operator version constraint). GB200/EKS was relying only on auto-discovered component health-checks for the deployment phase, with no version floor. Add the same four-check block to gb200-eks-training.yaml and gb200-eks-inference.yaml at the intent layer so every OS+platform variant inherits it (ubuntu-training, ubuntu-training-kubeflow, ubuntu-inference, ubuntu-inference-dynamo). Version floor is `Deployment.gpu-operator.version >= v25.10.0`. GB200 support stabilized in gpu-operator v25.10; the chart pin in the resolved GB200/EKS recipes is currently v25.10.1, so the constraint floors at the same minor — matches the H100 convention of pinning one patch below the validated chart version.

coderabbitai · 2026-05-14T20:27:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: fca4ee04-f3e7-411f-8076-311f1ce96a3a

📥 Commits

Reviewing files that changed from the base of the PR and between 1f3e1c2 and 874b4ea.

📒 Files selected for processing (2)

recipes/overlays/gb200-eks-inference.yaml
recipes/overlays/gb200-eks-training.yaml

📝 Walkthrough

Walkthrough

This PR adds deployment validation sections to two GB200 EKS recipe overlays. The gb200-eks-inference.yaml and gb200-eks-training.yaml files now each include a validation layer with four deployment checks—operator health, expected resources, GPU operator version, and nvidia-smi—plus a constraint requiring GPU operator version >= v25.10.0. These changes establish consistent validation requirements and a minimum GPU operator version baseline for both training and inference workloads.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

area/recipes, size/M

Suggested reviewers

lalitadithya

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically summarizes the main change: adding deployment validation to GB200/EKS recipes.
Description check	✅ Passed	The description is comprehensive and clearly related to the changeset, explaining the motivation, implementation details, testing approach, and risk assessment.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

njhensley requested a review from a team as a code owner May 14, 2026 20:25

github-actions Bot added area/recipes size/S labels May 14, 2026

njhensley self-assigned this May 14, 2026

mchmarny approved these changes May 14, 2026

View reviewed changes

mchmarny merged commit a2dab1b into NVIDIA:main May 14, 2026
58 checks passed

njhensley deleted the recipes/gb200-eks-deployment-validation branch June 23, 2026 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(recipes): add deployment validation to GB200/EKS recipes#895

feat(recipes): add deployment validation to GB200/EKS recipes#895
mchmarny merged 1 commit into
NVIDIA:mainfrom
njhensley:recipes/gb200-eks-deployment-validation

njhensley commented May 14, 2026

Uh oh!

coderabbitai Bot commented May 14, 2026

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented May 14, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented May 14, 2026

Walkthrough

Estimated code review effort

Suggested labels

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants