feat(recipes): add RTX PRO 6000 Blackwell (B40) overlays for EKS#1046
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds four RecipeMetadata overlays enabling RTX PRO 6000 inference on EKS: a base EKS inference overlay wiring GPU Operator and K8s version floor, an Ubuntu specialization, and two platform-specific leaves (dynamo and nim) that pin Helm charts/overrides, wire component dependencies, and declare deployment, performance, and conformance validations. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
ecd0494 to
2f8a5e8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml`:
- Around line 43-54: The PR added a new Helm chart pin for k8s-nim-operator
(version "3.1.0") but the BOM docs were not regenerated; run make bom-docs
locally, verify the generated docs reflect the k8s-nim-operator 3.1.0 entry, and
commit the updated docs/user/container-images.md alongside your changes so the
BOM docs include the new chart pin.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8ae5042e-e6c2-470a-9124-beacaf7c3f2c
📒 Files selected for processing (4)
recipes/overlays/rtx-pro-6000-eks-inference.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml
2f8a5e8 to
8d28007
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/rtx-pro-6000-eks-inference.yaml`:
- Around line 39-53: The comment "Single-card inference SKU — no NVLink, no NCCL
multi-node tuning needed" is ambiguous given the instance description "8x RTX
PRO 6000 Blackwell Server Edition"; update the comment near componentRefs (where
gpu-operator is declared) to clarify that "single-card" refers to the inference
workload pattern (per-GPU inference without NVLink/NCCL optimizations) rather
than the hardware count—e.g., mention that the node may have multiple GPUs (8x)
but workloads are configured for independent single-GPU inference so NVLink/NCCL
tuning is not required; keep the change local to the comment around
gpu-operator/componentRefs to avoid code changes.
In `@recipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml`:
- Around line 31-38: Remove the redundant constraint block for
"K8s.server.version" from this child overlay: the constraint named
K8s.server.version with value ">= 1.32.4" is already declared in the parent
overlay rtx-pro-6000-eks-inference, so delete the entire constraints entry that
references K8s.server.version in rtx-pro-6000-eks-ubuntu-inference.yaml to avoid
duplication and maintenance overhead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 4560a871-5582-42c9-a456-63b148ab5ebe
📒 Files selected for processing (4)
recipes/overlays/rtx-pro-6000-eks-inference.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml
8d28007 to
8bcee04
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml`:
- Around line 46-53: The overlay pins new chart versions for grove and
dynamo-platform; regenerate and commit the BOM docs by running make bom-docs,
then add the updated docs/user/container-images.md to this PR so the container
image BOM reflects the new chart version pins (ensure the regenerated file is
staged and included with the same changes that updated grove and
dynamo-platform).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 9eafd636-6509-4fc7-86f7-38e321ccd621
📒 Files selected for processing (4)
recipes/overlays/rtx-pro-6000-eks-inference.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yamlrecipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml
8bcee04 to
a69d690
Compare
a69d690 to
f052ff5
Compare
Adds the missing EKS recipe path for the RTX PRO 6000 enum, matching the H100 inference overlay layout: - rtx-pro-6000-eks-inference (service-bound parent; conformance) - rtx-pro-6000-eks-ubuntu-inference (Ubuntu OS leaf) - rtx-pro-6000-eks-ubuntu-inference-dynamo (Dynamo platform, DRA, perf) - rtx-pro-6000-eks-ubuntu-inference-nim (NIM Operator platform, DRA) The deployment-phase floor (gpu-operator >= v25.10.0) is inherited from the rtx-pro-6000-any.yaml wildcard for the plain inference leaves; the Dynamo and NIM leaves self-declare the same floor since their own deployment block replaces the wildcard contribution. Extend two existing invariant tables to cover the new leaves so future regressions are caught at PR time without a cluster: - pkg/recipe/metadata_test.go: TestNFDTopologyUpdater_OverlayCoverage gains 4 entries asserting topologyUpdater.enable=true on all 4 EKS leaves. - pkg/recipe/conformance_test.go: TestConformanceRecipeInvariants gains 4 entries asserting required components and conformance checks resolve on all 4 EKS leaves (15/15/17/16 components; 8/8/11/11 checks). knownGaps for the missing inference-NIM performance phase is intentionally NOT added here: AICR_VALIDATION_FLOOR_STRICT is not set in any CI workflow today, so the missing phase logs WARN, not FAIL, and the floor test's stale-entry hygiene check would fail a knownGaps entry that doesn't downgrade an actual failure. Same trade-off the floor test documents for the GB200 OKE training trio. Entries for both H100 NIM and the new RTX PRO 6000 NIM should land in the same PR that flips strict mode (tracked under NVIDIA#1003 / NVIDIA#969 / NVIDIA#1007). Reference cluster: AWS g7e.48xlarge (8x RTX PRO 6000 Blackwell Server Edition, internal codename B40) per the av-teststudio-40-prod-1 cluster spec on internal GitLab. Closes NVIDIA#1045
f052ff5 to
bc103d1
Compare
mchmarny
left a comment
There was a problem hiding this comment.
Clean overlay addition. The 4 new leaves mirror h100-eks-*-inference{,-dynamo,-nim}.yaml structurally, with the one correctness-load-bearing adjustment — Deployment.gpu-operator.version >= v25.10.0 — aligned to the rtx-pro-6000-any.yaml wildcard so the per-phase replace semantics keep the Blackwell floor intact in the self-declared deployment: blocks. DRA constraint (K8s.server.version >= 1.34) correctly gated to the Dynamo/NIM leaves that actually need it, and TestNFDTopologyUpdater_OverlayCoverage + TestConformanceRecipeInvariants were extended explicitly rather than relying on auto-enumeration.
The deferred knownGaps reasoning is well-justified — landing the NIM performance entry now would be stale-flagged by the hygiene check until strict-mode CI exists. Tracking that across H100 NIM / RTX PRO 6000 NIM / GB200 OKE training in one PR (under #1003 / #969 / #1007) is the right batching.
CI: all Tier 1 deploy variants passing; gke-cos-training (helm) still in progress at review time but unrelated to the RTX overlay surface. LGTM.
Summary
Adds the four missing EKS overlays for the
rtx-pro-6000accelerator enum, mirroring the H100 EKS inference layout: a service-bound parent, an Ubuntu OS leaf, and Dynamo + NIM platform leaves.Motivation / Context
The
rtx-pro-6000enum is declared inpkg/recipe/criteria.gobut its only overlays today target LKE + Workstation Edition. AWS now exposes RTX PRO 6000 Blackwell as theg7e.48xlargeinstance family (8× RTX PRO 6000 Blackwell Server Edition, internal codename B40), soaicr recipe --service eks --accelerator rtx-pro-6000had no resolvable leaf. Reference cluster:dgxcloud/mk8s/manifests:clusters/av-teststudio-prod/av-teststudio-40-prod-1/cluster-spec.yaml(internal GitLab) — runs gpu-operator driver580.105.08, CDI on, DRA off, EFA on.Fixes: #1045
Related: #1003 (sibling L40/L40S effort), #969 (validation phase floor)
Type of Change
Component(s) Affected
pkg/recipe)(YAML data + two existing Go test tables extended to cover the new overlays. No production Go code changes.)
Implementation Notes
Layout mirrors
h100-eks-ubuntu-inference{,-dynamo,-nim}.yaml:rtx-pro-6000-eks-inferenceeks-inferencertx-pro-6000-any.yamlwildcardrtx-pro-6000-eks-ubuntu-inferencertx-pro-6000-eks-inferencertx-pro-6000-eks-ubuntu-inference-dynamortx-pro-6000-eks-ubuntu-inferencedeployment:)rtx-pro-6000-eks-ubuntu-inference-nimrtx-pro-6000-eks-ubuntu-inferencedeployment:)Key decisions:
>= 1.32.4for the base/Ubuntu leaves (matches H100 EKS),>= 1.34for Dynamo/NIM leaves (DRA GA requirement, matches H100 EKS Dynamo/NIM).>= v25.10.0(Blackwell-era), matching thertx-pro-6000-any.yamlwildcard. Dynamo and NIM self-declare the same value so the per-phase replace semantics keep the version pin intact.nodewright-customizations: matches the existingrtx-pro-6000-lke-*pattern. The H100 EKS overlays add this for H100-specific kernel tuning; that's not the established convention for RTX Pro 6000 single-card inference.inference-perfblock with the same loose smoke-test thresholds as the H100 Dynamo overlay, to be tightened once reference runs are published. NIM leaf omits performance to match the H100 NIM overlay (warn-only under the phase floor; not required).rtx-pro-6000enum still covers both Workstation (LKE) and Server (EKS) Edition. Per feat(recipes): add concrete RTX PRO 6000 Blackwell service-bound overlays for EKS #1045, splitting them into separate enum values is deferred unless an empirical floor difference forces it.knownGapsdeferred (not added here):pkg/recipe/validation_phase_floor_test.gomarks performance as warn-only for inference-NIM in default mode (AICR_VALIDATION_FLOOR_STRICTis not set in any CI workflow today — verified viagrep). The hygiene check at lines 337-352 fails the test if aknownGapsentry doesn't downgrade an actual failure, so adding an entry now for the missing NIM performance phase would be stale-flagged and break the test. This is the same trade-off the floor test itself documents (lines 64-75) for the GB200 OKE training trio. When strict mode is toggled in CI (tracked under feat(recipes): add concrete L40 / L40S service-bound overlays #1003 / Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969 / feat(validation): add performance-phase constraints to OKE overlays once OCI testbed lands #1007), entries for the H100 NIM, the new RTX PRO 6000 NIM, and the OKE GB200 training leaves should land together in that PR — not piecemeal here.Testing
Results:
TestOverlayValidationPhaseFloorpasses for all 4 new overlays (deployment phase inherited from wildcard for plain inference; self-declared for Dynamo/NIM with both halves of the gpu-operator-version gate).TestNFDTopologyUpdater_OverlayCoverageextended with 4 new entries (rtx-pro-6000-eks-{inference,ubuntu-inference,ubuntu-inference-dynamo,ubuntu-inference-nim}); all asserttopologyUpdater.enable=trueand pass.TestConformanceRecipeInvariantsextended with the same 4 entries; resolved component counts and conformance check coverage match the H100 EKS sibling pattern (15/15/17/16 components, 8/8/11/11 conformance checks).golangci-lint: 0 issues.make qualify: all 22 chainsaw tests pass, no vulnerabilities, license headers OK, codebase qualification completed.aicr recipe --service eks --accelerator rtx-pro-6000 --os ubuntu --intent inference --platform dynamoresolves cleanly (17 components, 8 overlays composed).Coverage: YAML data + test additions only — per CLAUDE.md the per-package coverage gate does not apply (no new production Go code). Project-wide
make test-coveragefloor (70%) still passes undermake qualify.Risk Assessment
Rollout notes: Pure addition of four new overlay files plus two test-table extensions. No existing recipe behavior changes; the new overlays only resolve when a user explicitly queries
--service eks --accelerator rtx-pro-6000. Easy revert isrmof the four files plus reverting the two test edits.Checklist
make testwith-race)make lint)TestOverlayValidationPhaseFloorauto-enumerates;TestNFDTopologyUpdater_OverlayCoverageandTestConformanceRecipeInvariantsextended explicitly to cover the 4 new EKS leaves)aicr criteria list)git commit -S)