feat: GB200 EKS NET/NVLS NCCL validation and driver bump by njhensley · Pull Request #668 · NVIDIA/aicr

njhensley · 2026-04-24T14:19:02Z

Summary

Wire the GB200/EKS overlays (training and inference) to self-fulfill the NVreg_GrdmaPciTopoCheckOverride=1 kernel module flag, pin the driver to the NVIDIA-recommended floor for GB200+EFA (580.126.20), and adopt the transport-specific nccl-all-reduce-bw-net / -nvls performance checks that were introduced in #640.

Motivation / Context

After #640 landed, the NET + NVLS NCCL variants existed in the validator catalog but no shipped overlay referenced them — GB200/EKS users still ran the legacy auto-detect nccl-all-reduce-bw and got no transport-specific signal. Separately, GB200/EKS has two operator-side prerequisites that the existing overlay left as manual configuration:

NVreg_GrdmaPciTopoCheckOverride=1 — required on the NVIDIA driver so EFA can attach dma-buf to the Grace PCI topology. Without it, NCCL silently falls back to Socket on NET.
Driver version floor — NVIDIA's recommended driver for GB200+EFA is 580.126.20; the repo-wide default stayed on 580.105.08 for H100/B200.

This PR closes both gaps declaratively in the overlay, so GB200/EKS recipes come up correctly out of the box, and both the training and inference intents assert the two fabrics (EFA + MNNVL) actually carry traffic.

Fixes: N/A
Related: #640 (validator-side NET/NVLS implementation)

Type of Change

New feature (non-breaking change that adds functionality)
Documentation update

Component(s) Affected

Recipe engine / data (pkg/recipe)
Docs/examples (docs/, examples/)

Implementation Notes

Four commits, each independently revertable:

feat(recipes/gb200-eks): self-fulfill NVreg_GrdmaPciTopoCheckOverride=1 — Adds manifestFiles: [components/gpu-operator/manifests/kernel-module-params.yaml] (existing template, already in the component directory) and points ClusterPolicy.spec.driver.kernelModuleConfig.name at the resulting ConfigMap. The NVIDIA driver DaemonSet then mounts nvidia.conf at load time and the kernel comes up with the flag already set. The existing NVreg preflight check stays in place as a belt-and-suspenders guard for operators who override the module config at a higher layer.
feat(recipes/gb200-eks): bump driver to 580.126.20 — Override gpu-operator.driver.version at the GB200/EKS overlay layer only. H100/B200 and non-EKS GB200 (OKE, GKE, AKS) keep the global 580.105.08 default from components/gpu-operator/values.yaml. The version recommendation is GB200+EFA-specific on EKS; narrower blast radius than a global bump.
feat(recipes/gb200-eks): adopt nccl-all-reduce-bw-net and -nvls constraints — Replaces the inherited nccl-all-reduce-bw >= 720 from gb200-any-training with nccl-all-reduce-bw-net >= 40 and nccl-all-reduce-bw-nvls >= 500 on gb200-eks-training. ValidationPhase replaces rather than merges, so this is a clean swap on GB200/EKS recipes only; non-EKS GB200 and non-GB200 accelerators keep the legacy entry unchanged. Thresholds are deliberately conservative, sized for a 2-node GB200 pair — will be raised once production NVL72 data is available.
feat(recipes/gb200-eks): extend NCCL variants + NVreg fulfillment to inference — Mirrors the above into gb200-eks-inference.yaml. NCCL all-reduce-bw is fabric-health, not training-specific: multi-node inference (tensor-parallel serving, MoE expert parallelism) crosses the same EFA + MNNVL fabrics and has the same dma-buf attach requirement. Also updates docs/user/validation.md with a 3-variant table documenting when each check is selected.

Mixin alternative, rejected. Initially tried extracting the shared GB200/EKS GPU-operator block into a mixin, but the mixin system is strictly additive — it can introduce new componentRef names but cannot extend a componentRef already declared upstream in the inheritance chain (gpu-operator comes from eks-training / eks-inference). Per-leaf duplication with a "keep in sync" comment is the pragmatic choice until the mixin system gains extension semantics; concretely that's ~10 duplicated lines across two files vs. a much larger refactor.

Testing

# Static verification
make qualify                                    # passes, 0 lint issues

# Recipe hydration
aicr query --service eks --accelerator gb200 --intent training --os ubuntu --platform kubeflow \
  --selector components.gpu-operator.values.driver            # → version 580.126.20, kernelModuleConfig.name nvidia-kernel-module-params
aicr query --service eks --accelerator gb200 --intent inference --os ubuntu \
  --selector components.gpu-operator.values.driver            # → same
aicr query --service oke --accelerator gb200 --intent training --os ubuntu \
  --selector components.gpu-operator.values.driver.version    # → 580.105.08 (unchanged)
aicr query --service eks --accelerator h100 --intent training --os ubuntu \
  --selector components.gpu-operator.values.driver.version    # → 580.105.08 (unchanged)

End-to-end on real GB200/EKS hardware (EKS 1.34, Ubuntu 24.04, 2× p6e-gb200.36xlarge, ASG-terminated before redeploy for a clean-boot driver rollout):

Check	Measured	Threshold
`nccl-all-reduce-bw-net`	329.59 GB/s	≥ 40
`nccl-all-reduce-bw-nvls`	841.49 GB/s	≥ 500
8 conformance checks	all pass	—

On-cluster verification of the NVreg self-fulfillment:

nvidia-kernel-module-params ConfigMap present in gpu-operator namespace with options nvidia NVreg_GrdmaPciTopoCheckOverride=1
Driver DaemonSet image: nvcr.io/nvidia/driver:580.126.20-ubuntu24.04; ConfigMap mounted at /drivers/nvidia.conf
/proc/driver/nvidia/params on both GB200 nodes reports GrdmaPciTopoCheckOverride: 1 — flag is live in the loaded kernel module, not just declared

Risk Assessment

Low — Isolated to two GB200/EKS overlays + one user-doc page. No CLI, API, or validator-engine code changes.

Rollout notes:

The three behavior changes (driver 580.126.20, NVreg ConfigMap, NCCL variants) are scoped to GB200/EKS via overlay path, verified by aicr query against OKE/H100 recipes showing no drift.
The driver bump redeploys the nvidia-driver DaemonSet on GB200/EKS clusters that regenerate their bundle; plan this alongside a rolling-replace of GPU nodes (ASG terminate with --no-should-decrement-desired-capacity) so the new driver lands on clean-boot replacements rather than reinstalling over running kernel state.
No migration needed for non-GB200 or non-EKS recipes.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint — golangci-lint run ./... → 0 issues)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality (validator-side tests for the new NCCL variants landed in feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check #640; this PR is recipe/doc only)
I updated docs if user-facing behavior changed (docs/user/validation.md 3-variant table)
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S)

Wire the existing kernel-module-params ConfigMap template into the GB200/EKS overlay and point gpu-operator ClusterPolicy at it via driver.kernelModuleConfig.name. The NVIDIA driver DaemonSet now mounts nvidia.conf at load time and the kernel comes up with the flag set, which is required on GB200+EFA for EFA dma-buf attach to the Grace PCI topology. Without the flag, NCCL silently falls back to the Socket transport. The existing NVreg preflight check stays in place as a belt-and-suspenders guard: it keeps its actionable error message for operators who disable the override at a higher layer or ship a cluster with a different module config. Scope: GB200/EKS only. The PCIe-topology issue is EKS+EFA specific; OKE, GKE, and AKS GB200 overlays are unaffected. Verified by bundling eks/gb200/ubuntu/training and inspecting gpu-operator/manifests/kernel-module-params.yaml + values.yaml; h100/eks bundle does NOT render the ConfigMap.

…floor) Override gpu-operator.driver.version at the GB200/EKS overlay layer so GB200+EFA recipes ship with the NVIDIA-recommended driver floor while H100/B200 and non-EKS GB200 stay on the global 580.105.08 default in components/gpu-operator/values.yaml. Narrower blast radius than a global bump: the version recommendation is specific to GB200+EFA dma-buf topology on EKS, and Skyhook compatibility already diverges between accelerators (see the GB200 no-op comment in this same overlay). Verified with aicr query --selector components.gpu-operator.values.driver.version: gb200/eks -> 580.126.20 h100/eks -> 580.105.08 (unchanged) gb200/oke -> 580.105.08 (unchanged)

…raints Default GB200/EKS training recipes to the two transport-specific NCCL variants introduced earlier on this branch series. The validator Catalog entries already exist; no overlay referenced them until now. NET exercises EFA and NVLS exercises MNNVL across the NVL72 IMEX domain. Each variant asserts its transport actually carried traffic (via the verifyTransportFromLogs check in validators/performance), so a silent fallback to Socket or NET cannot masquerade as a pass — a failure mode the legacy nccl-all-reduce-bw check cannot distinguish. Thresholds are deliberately conservative (NET >= 40 GB/s, NVLS >= 500 GB/s), sized for a 2-node GB200 pair. They catch clear misconfigurations today and will be raised once production NVL72 data is available. Merge behavior: ValidationPhase replaces rather than merges, so this block replaces the inherited nccl-all-reduce-bw >= 720 from gb200-any-training on GB200/EKS recipes only. Non-EKS GB200 (OKE, etc.) and non-GB200 accelerators keep the legacy entry unchanged. Verified by resolving recipes for gb200/eks (NET+NVLS), gb200/oke (legacy 720), and h100/eks (legacy 300).

…inference NCCL all-reduce-bw-net / -nvls measure fabric health (EFA inter-node + MNNVL intra-NVL72), not anything training-specific. Multi-node inference on GB200/EKS — tensor-parallel serving for large models, MoE expert parallelism — crosses the same fabrics as training all-reduce and has the same NVreg_GrdmaPciTopoCheckOverride=1 dma-buf attach requirement. Tried an extraction into a gb200-eks-gpuops mixin first, but the mixin system is strictly additive: a mixin can only introduce new componentRef names, not extend one already defined in the inheritance chain (and eks-training / eks-inference both declare gpu-operator with a valuesFile). Falling back to per-leaf duplication with "keep in sync" comments — 34 added lines on the inference side, 0 meaningful change on training. Changes: - gb200-eks-inference.yaml: gpu-operator componentRef gains the same kernel-module-params manifestFile + driver.kernelModuleConfig.name + driver.version:580.126.20 + cdi/gdrcopy overrides that landed for training in c162888/3c32e9ed. Also adds the nccl-all-reduce-bw-net (>=40) and -nvls (>=500) performance constraints. - gb200-eks-training.yaml: comment updated to flag the training/inference sync relationship; content unchanged. - docs/user/validation.md: documents all three NCCL variants in a table with platform→variant selection rules, replacing the single-variant description. Closes the "docs/user/validation.md still only documents nccl-all-reduce-bw" follow-up now that an overlay adopts the variants. Verified via `aicr query`: - eks/gb200/training and eks/gb200/inference both hydrate driver.version=580.126.20 and kernelModuleConfig.name= nvidia-kernel-module-params. - Both carry nccl-all-reduce-bw-net/-nvls under validation.performance.constraints. - oke/gb200 and eks/h100 still hydrate driver.version=580.105.08 (the global default) — no collateral impact.

coderabbitai · 2026-04-24T14:25:26Z

📝 Walkthrough

Walkthrough

The changes update NCCL all-reduce bandwidth validation documentation and configuration for GB200 EKS deployments. Documentation in validation.md is revised to describe how validation checks are selected by recipe and platform fabric rather than using a single default approach. Two recipe files—gb200-eks-inference.yaml and gb200-eks-training.yaml—are updated with GPU operator kernel module parameter configuration and new performance validation sections that specify NCCL all-reduce checks over NET and NVLS transports with corresponding minimum bandwidth thresholds.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main changes: GB200 EKS driver version bump and adoption of NET/NVLS NCCL validation checks, matching the core modifications in both recipe overlays and documentation.
Description check	✅ Passed	The description provides detailed context about wiring GB200/EKS overlays for kernel module flag self-fulfillment, driver version pinning, and transport-specific NCCL checks, with testing results and rollout guidance directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/validation.md`:
- Around line 43-57: Update the opening sentence to state that the NCCL
all-reduce benchmark and its three check variants apply to both training and
inference recipes (not just training), and ensure the subsequent sentence
explicitly notes that GB200/EKS recipes for both the "training" and "inference"
intents enable the `-net` and `-nvls` variants together; reference the check
names `nccl-all-reduce-bw`, `nccl-all-reduce-bw-net`, and
`nccl-all-reduce-bw-nvls` and keep the existing table and explanatory sentences
but change wording where needed so the scope clearly covers both intents.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 52bb8e9e-e13c-45ab-b571-7b68db16fde5

📥 Commits

Reviewing files that changed from the base of the PR and between 4e158cf and 84d5746.

📒 Files selected for processing (3)

docs/user/validation.md
recipes/overlays/gb200-eks-inference.yaml
recipes/overlays/gb200-eks-training.yaml

njhensley added 4 commits April 24, 2026 07:11

njhensley requested review from a team as code owners April 24, 2026 14:19

github-actions Bot added area/recipes area/docs size/M labels Apr 24, 2026

njhensley requested a review from mchmarny April 24, 2026 14:19

njhensley self-assigned this Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread docs/user/validation.md

mchmarny approved these changes Apr 24, 2026

View reviewed changes

mchmarny merged commit 306b785 into NVIDIA:main Apr 24, 2026
72 checks passed

lockwobr pushed a commit that referenced this pull request Apr 28, 2026

feat: GB200 EKS NET/NVLS NCCL validation and driver bump (#668)

8bc7794

njhensley mentioned this pull request May 12, 2026

bundler: manifestFiles always emitted as -post, breaking prereq ConfigMaps (GB200/EKS deploy hangs) #859

Closed

mchmarny mentioned this pull request May 13, 2026

fix(recipes): migrate GB200 kernel-module-params to preManifestFiles #868

Merged

25 tasks

njhensley deleted the feat/gb200-eks-adoption-and-driver-bump branch June 23, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GB200 EKS NET/NVLS NCCL validation and driver bump#668

feat: GB200 EKS NET/NVLS NCCL validation and driver bump#668
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:feat/gb200-eks-adoption-and-driver-bump

njhensley commented Apr 24, 2026

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

njhensley commented Apr 24, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

coderabbitai Bot commented Apr 24, 2026

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants