Skip to content

feat(recipes): add RTX PRO 6000 Blackwell (B40) overlays for EKS#1046

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/rtx-pro-6000-eks-inference-overlays-1045
May 28, 2026
Merged

feat(recipes): add RTX PRO 6000 Blackwell (B40) overlays for EKS#1046
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:feat/rtx-pro-6000-eks-inference-overlays-1045

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the four missing EKS overlays for the rtx-pro-6000 accelerator enum, mirroring the H100 EKS inference layout: a service-bound parent, an Ubuntu OS leaf, and Dynamo + NIM platform leaves.

Motivation / Context

The rtx-pro-6000 enum is declared in pkg/recipe/criteria.go but its only overlays today target LKE + Workstation Edition. AWS now exposes RTX PRO 6000 Blackwell as the g7e.48xlarge instance family (8× RTX PRO 6000 Blackwell Server Edition, internal codename B40), so aicr recipe --service eks --accelerator rtx-pro-6000 had no resolvable leaf. Reference cluster: dgxcloud/mk8s/manifests:clusters/av-teststudio-prod/av-teststudio-40-prod-1/cluster-spec.yaml (internal GitLab) — runs gpu-operator driver 580.105.08, CDI on, DRA off, EFA on.

Fixes: #1045
Related: #1003 (sibling L40/L40S effort), #969 (validation phase floor)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Recipe engine / data (pkg/recipe)

(YAML data + two existing Go test tables extended to cover the new overlays. No production Go code changes.)

Implementation Notes

Layout mirrors h100-eks-ubuntu-inference{,-dynamo,-nim}.yaml:

New overlay Inherits from Validation contract
rtx-pro-6000-eks-inference eks-inference conformance declared; deployment phase inherited from rtx-pro-6000-any.yaml wildcard
rtx-pro-6000-eks-ubuntu-inference rtx-pro-6000-eks-inference inherits from parent + wildcard
rtx-pro-6000-eks-ubuntu-inference-dynamo rtx-pro-6000-eks-ubuntu-inference self-declares deployment + performance + conformance (Dynamo platform replaces wildcard deployment:)
rtx-pro-6000-eks-ubuntu-inference-nim rtx-pro-6000-eks-ubuntu-inference self-declares deployment + conformance (NIM platform replaces wildcard deployment:)

Key decisions:

  • K8s version pin: >= 1.32.4 for the base/Ubuntu leaves (matches H100 EKS), >= 1.34 for Dynamo/NIM leaves (DRA GA requirement, matches H100 EKS Dynamo/NIM).
  • gpu-operator floor: >= v25.10.0 (Blackwell-era), matching the rtx-pro-6000-any.yaml wildcard. Dynamo and NIM self-declare the same value so the per-phase replace semantics keep the version pin intact.
  • No nodewright-customizations: matches the existing rtx-pro-6000-lke-* pattern. The H100 EKS overlays add this for H100-specific kernel tuning; that's not the established convention for RTX Pro 6000 single-card inference.
  • No NCCL bandwidth threshold: single-node inference is the dominant use case, no NVLink between cards on PCIe Blackwell Server Edition.
  • Conformance check sets: plain inference leaves use the 8-check LKE set (no gang-scheduling); Dynamo/NIM use the 11-check H100 set (adds gang-scheduling, robust-controller, secure-accelerator-access).
  • Performance phase: Dynamo leaf has a placeholder inference-perf block with the same loose smoke-test thresholds as the H100 Dynamo overlay, to be tightened once reference runs are published. NIM leaf omits performance to match the H100 NIM overlay (warn-only under the phase floor; not required).
  • SKU edition: a single rtx-pro-6000 enum still covers both Workstation (LKE) and Server (EKS) Edition. Per feat(recipes): add concrete RTX PRO 6000 Blackwell service-bound overlays for EKS #1045, splitting them into separate enum values is deferred unless an empirical floor difference forces it.
  • knownGaps deferred (not added here): pkg/recipe/validation_phase_floor_test.go marks performance as warn-only for inference-NIM in default mode (AICR_VALIDATION_FLOOR_STRICT is not set in any CI workflow today — verified via grep). The hygiene check at lines 337-352 fails the test if a knownGaps entry doesn't downgrade an actual failure, so adding an entry now for the missing NIM performance phase would be stale-flagged and break the test. This is the same trade-off the floor test itself documents (lines 64-75) for the GB200 OKE training trio. When strict mode is toggled in CI (tracked under feat(recipes): add concrete L40 / L40S service-bound overlays #1003 / Close deployment-phase validation coverage gaps for accelerator-bound GPU recipes #969 / feat(validation): add performance-phase constraints to OKE overlays once OCI testbed lands #1007), entries for the H100 NIM, the new RTX PRO 6000 NIM, and the OKE GB200 training leaves should land together in that PR — not piecemeal here.

Testing

# overlay phase-floor gate (the gate referenced by #1045 and #1003)
go test -v ./pkg/recipe/... -run TestOverlayValidationPhaseFloor

# extended invariant tests (new entries for the 4 EKS RTX PRO 6000 overlays)
go test -v ./pkg/recipe/... -run "TestNFDTopologyUpdater_OverlayCoverage|TestConformanceRecipeInvariants"

# pkg/recipe race + lint
GOFLAGS="-mod=vendor" go test -race ./pkg/recipe/...
golangci-lint run -c .golangci.yaml ./pkg/recipe/...

# full gate
make qualify

Results:

  • TestOverlayValidationPhaseFloor passes for all 4 new overlays (deployment phase inherited from wildcard for plain inference; self-declared for Dynamo/NIM with both halves of the gpu-operator-version gate).
  • TestNFDTopologyUpdater_OverlayCoverage extended with 4 new entries (rtx-pro-6000-eks-{inference,ubuntu-inference,ubuntu-inference-dynamo,ubuntu-inference-nim}); all assert topologyUpdater.enable=true and pass.
  • TestConformanceRecipeInvariants extended with the same 4 entries; resolved component counts and conformance check coverage match the H100 EKS sibling pattern (15/15/17/16 components, 8/8/11/11 conformance checks).
  • golangci-lint: 0 issues.
  • make qualify: all 22 chainsaw tests pass, no vulnerabilities, license headers OK, codebase qualification completed.
  • Smoke test: aicr recipe --service eks --accelerator rtx-pro-6000 --os ubuntu --intent inference --platform dynamo resolves cleanly (17 components, 8 overlays composed).

Coverage: YAML data + test additions only — per CLAUDE.md the per-package coverage gate does not apply (no new production Go code). Project-wide make test-coverage floor (70%) still passes under make qualify.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: Pure addition of four new overlay files plus two test-table extensions. No existing recipe behavior changes; the new overlays only resolve when a user explicitly queries --service eks --accelerator rtx-pro-6000. Easy revert is rm of the four files plus reverting the two test edits.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (TestOverlayValidationPhaseFloor auto-enumerates; TestNFDTopologyUpdater_OverlayCoverage and TestConformanceRecipeInvariants extended explicitly to cover the 4 new EKS leaves)
  • I updated docs if user-facing behavior changed (no user-facing CLI/API surface changed — overlays are data, discoverable via aicr criteria list)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Tracking issue: #1045 — feat(recipes): add concrete RTX PRO 6000 Blackwell service-bound overlays for EKS. #1045

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds four RecipeMetadata overlays enabling RTX PRO 6000 inference on EKS: a base EKS inference overlay wiring GPU Operator and K8s version floor, an Ubuntu specialization, and two platform-specific leaves (dynamo and nim) that pin Helm charts/overrides, wire component dependencies, and declare deployment, performance, and conformance validations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/aicr#1001: Related deployment-phase validation and gpu-operator version floor logic referenced by these overlays.

Suggested reviewers

  • xdu31
  • lockwobr
  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR fully implements the minimum scope requirements from issue #1045: adds all four required EKS overlays (parent service-bound, OS-bound, and two platform leaves), inherits/sets proper K8s version pins (>=1.32.4 base, >=1.34 Dynamo/NIM), maintains gpu-operator floor >=v25.10.0, and passes TestOverlayValidationPhaseFloor validation.
Out of Scope Changes check ✅ Passed All changes are scoped to the four new overlay YAML files (rtx-pro-6000-eks-inference.yaml, rtx-pro-6000-eks-ubuntu-inference.yaml, rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml, and rtx-pro-6000-eks-ubuntu-inference-nim.yaml) with no out-of-scope additions like Go code changes, training overlays, or enum splits.
Description check ✅ Passed The PR description clearly documents the addition of four EKS overlays for rtx-pro-6000, directly matching the changeset of four new YAML files with detailed motivation, implementation strategy, and test results.
Title check ✅ Passed The title accurately reflects the main change: adding four RTX PRO 6000 overlay recipes for EKS inference workloads with clear specification of the accelerator type and cloud platform.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

mchmarny
mchmarny previously approved these changes May 26, 2026
@mchmarny mchmarny self-requested a review May 26, 2026 22:42
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-eks-inference-overlays-1045 branch 2 times, most recently from ecd0494 to 2f8a5e8 Compare May 27, 2026 00:17
@yuanchen8911 yuanchen8911 removed the request for review from mchmarny May 27, 2026 00:19

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml`:
- Around line 43-54: The PR added a new Helm chart pin for k8s-nim-operator
(version "3.1.0") but the BOM docs were not regenerated; run make bom-docs
locally, verify the generated docs reflect the k8s-nim-operator 3.1.0 entry, and
commit the updated docs/user/container-images.md alongside your changes so the
BOM docs include the new chart pin.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8ae5042e-e6c2-470a-9124-beacaf7c3f2c

📥 Commits

Reviewing files that changed from the base of the PR and between ecd0494 and 2f8a5e8.

📒 Files selected for processing (4)
  • recipes/overlays/rtx-pro-6000-eks-inference.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml

Comment thread recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-eks-inference-overlays-1045 branch from 2f8a5e8 to 8d28007 Compare May 27, 2026 19:55

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/rtx-pro-6000-eks-inference.yaml`:
- Around line 39-53: The comment "Single-card inference SKU — no NVLink, no NCCL
multi-node tuning needed" is ambiguous given the instance description "8x RTX
PRO 6000 Blackwell Server Edition"; update the comment near componentRefs (where
gpu-operator is declared) to clarify that "single-card" refers to the inference
workload pattern (per-GPU inference without NVLink/NCCL optimizations) rather
than the hardware count—e.g., mention that the node may have multiple GPUs (8x)
but workloads are configured for independent single-GPU inference so NVLink/NCCL
tuning is not required; keep the change local to the comment around
gpu-operator/componentRefs to avoid code changes.

In `@recipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml`:
- Around line 31-38: Remove the redundant constraint block for
"K8s.server.version" from this child overlay: the constraint named
K8s.server.version with value ">= 1.32.4" is already declared in the parent
overlay rtx-pro-6000-eks-inference, so delete the entire constraints entry that
references K8s.server.version in rtx-pro-6000-eks-ubuntu-inference.yaml to avoid
duplication and maintenance overhead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 4560a871-5582-42c9-a456-63b148ab5ebe

📥 Commits

Reviewing files that changed from the base of the PR and between 2f8a5e8 and 8d28007.

📒 Files selected for processing (4)
  • recipes/overlays/rtx-pro-6000-eks-inference.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml

Comment thread recipes/overlays/rtx-pro-6000-eks-inference.yaml
Comment thread recipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-eks-inference-overlays-1045 branch from 8d28007 to 8bcee04 Compare May 28, 2026 13:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml`:
- Around line 46-53: The overlay pins new chart versions for grove and
dynamo-platform; regenerate and commit the BOM docs by running make bom-docs,
then add the updated docs/user/container-images.md to this PR so the container
image BOM reflects the new chart version pins (ensure the regenerated file is
staged and included with the same changes that updated grove and
dynamo-platform).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 9eafd636-6509-4fc7-86f7-38e321ccd621

📥 Commits

Reviewing files that changed from the base of the PR and between 8d28007 and 8bcee04.

📒 Files selected for processing (4)
  • recipes/overlays/rtx-pro-6000-eks-inference.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-nim.yaml
  • recipes/overlays/rtx-pro-6000-eks-ubuntu-inference.yaml

Comment thread recipes/overlays/rtx-pro-6000-eks-ubuntu-inference-dynamo.yaml
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-eks-inference-overlays-1045 branch from 8bcee04 to a69d690 Compare May 28, 2026 14:54
@yuanchen8911 yuanchen8911 changed the title WIP: feat(recipes): add RTX PRO 6000 Blackwell overlays for EKS feat(recipes): add RTX PRO 6000 Blackwell overlays for EKS May 28, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 15:15
@yuanchen8911 yuanchen8911 requested review from a team as code owners May 28, 2026 15:15
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-eks-inference-overlays-1045 branch from a69d690 to f052ff5 Compare May 28, 2026 19:17
@yuanchen8911 yuanchen8911 marked this pull request as draft May 28, 2026 19:51
@yuanchen8911 yuanchen8911 changed the title feat(recipes): add RTX PRO 6000 Blackwell overlays for EKS feat(recipes): add RTX PRO 6000 Blackwell (B40) overlays for EKS May 28, 2026
Adds the missing EKS recipe path for the RTX PRO 6000 enum, matching the
H100 inference overlay layout:

- rtx-pro-6000-eks-inference            (service-bound parent; conformance)
- rtx-pro-6000-eks-ubuntu-inference     (Ubuntu OS leaf)
- rtx-pro-6000-eks-ubuntu-inference-dynamo  (Dynamo platform, DRA, perf)
- rtx-pro-6000-eks-ubuntu-inference-nim     (NIM Operator platform, DRA)

The deployment-phase floor (gpu-operator >= v25.10.0) is inherited from
the rtx-pro-6000-any.yaml wildcard for the plain inference leaves; the
Dynamo and NIM leaves self-declare the same floor since their own
deployment block replaces the wildcard contribution.

Extend two existing invariant tables to cover the new leaves so future
regressions are caught at PR time without a cluster:

- pkg/recipe/metadata_test.go: TestNFDTopologyUpdater_OverlayCoverage gains
  4 entries asserting topologyUpdater.enable=true on all 4 EKS leaves.
- pkg/recipe/conformance_test.go: TestConformanceRecipeInvariants gains
  4 entries asserting required components and conformance checks resolve
  on all 4 EKS leaves (15/15/17/16 components; 8/8/11/11 checks).

knownGaps for the missing inference-NIM performance phase is intentionally
NOT added here: AICR_VALIDATION_FLOOR_STRICT is not set in any CI workflow
today, so the missing phase logs WARN, not FAIL, and the floor test's
stale-entry hygiene check would fail a knownGaps entry that doesn't
downgrade an actual failure. Same trade-off the floor test documents for
the GB200 OKE training trio. Entries for both H100 NIM and the new
RTX PRO 6000 NIM should land in the same PR that flips strict mode
(tracked under NVIDIA#1003 / NVIDIA#969 / NVIDIA#1007).

Reference cluster: AWS g7e.48xlarge (8x RTX PRO 6000 Blackwell Server
Edition, internal codename B40) per the av-teststudio-40-prod-1
cluster spec on internal GitLab.

Closes NVIDIA#1045
@yuanchen8911 yuanchen8911 force-pushed the feat/rtx-pro-6000-eks-inference-overlays-1045 branch from f052ff5 to bc103d1 Compare May 28, 2026 21:52
@yuanchen8911 yuanchen8911 marked this pull request as ready for review May 28, 2026 21:52
@yuanchen8911 yuanchen8911 requested a review from mchmarny May 28, 2026 22:05
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) May 28, 2026 22:07

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean overlay addition. The 4 new leaves mirror h100-eks-*-inference{,-dynamo,-nim}.yaml structurally, with the one correctness-load-bearing adjustment — Deployment.gpu-operator.version >= v25.10.0 — aligned to the rtx-pro-6000-any.yaml wildcard so the per-phase replace semantics keep the Blackwell floor intact in the self-declared deployment: blocks. DRA constraint (K8s.server.version >= 1.34) correctly gated to the Dynamo/NIM leaves that actually need it, and TestNFDTopologyUpdater_OverlayCoverage + TestConformanceRecipeInvariants were extended explicitly rather than relying on auto-enumeration.

The deferred knownGaps reasoning is well-justified — landing the NIM performance entry now would be stale-flagged by the hygiene check until strict-mode CI exists. Tracking that across H100 NIM / RTX PRO 6000 NIM / GB200 OKE training in one PR (under #1003 / #969 / #1007) is the right batching.

CI: all Tier 1 deploy variants passing; gke-cos-training (helm) still in progress at review time but unrelated to the RTX overlay surface. LGTM.

@yuanchen8911 yuanchen8911 merged commit f385af2 into NVIDIA:main May 28, 2026
203 of 205 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(recipes): add concrete RTX PRO 6000 Blackwell service-bound overlays for EKS

2 participants