Skip to content

feat: GB200 EKS NET/NVLS NCCL validation and driver bump#668

Merged
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:feat/gb200-eks-adoption-and-driver-bump
Apr 24, 2026
Merged

feat: GB200 EKS NET/NVLS NCCL validation and driver bump#668
mchmarny merged 4 commits into
NVIDIA:mainfrom
njhensley:feat/gb200-eks-adoption-and-driver-bump

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Wire the GB200/EKS overlays (training and inference) to self-fulfill the NVreg_GrdmaPciTopoCheckOverride=1 kernel module flag, pin the driver to the NVIDIA-recommended floor for GB200+EFA (580.126.20), and adopt the transport-specific nccl-all-reduce-bw-net / -nvls performance checks that were introduced in #640.

Motivation / Context

After #640 landed, the NET + NVLS NCCL variants existed in the validator catalog but no shipped overlay referenced them — GB200/EKS users still ran the legacy auto-detect nccl-all-reduce-bw and got no transport-specific signal. Separately, GB200/EKS has two operator-side prerequisites that the existing overlay left as manual configuration:

  1. NVreg_GrdmaPciTopoCheckOverride=1 — required on the NVIDIA driver so EFA can attach dma-buf to the Grace PCI topology. Without it, NCCL silently falls back to Socket on NET.
  2. Driver version floor — NVIDIA's recommended driver for GB200+EFA is 580.126.20; the repo-wide default stayed on 580.105.08 for H100/B200.

This PR closes both gaps declaratively in the overlay, so GB200/EKS recipes come up correctly out of the box, and both the training and inference intents assert the two fabrics (EFA + MNNVL) actually carry traffic.

Fixes: N/A
Related: #640 (validator-side NET/NVLS implementation)

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Documentation update

Component(s) Affected

  • Recipe engine / data (pkg/recipe)
  • Docs/examples (docs/, examples/)

Implementation Notes

Four commits, each independently revertable:

  1. feat(recipes/gb200-eks): self-fulfill NVreg_GrdmaPciTopoCheckOverride=1 — Adds manifestFiles: [components/gpu-operator/manifests/kernel-module-params.yaml] (existing template, already in the component directory) and points ClusterPolicy.spec.driver.kernelModuleConfig.name at the resulting ConfigMap. The NVIDIA driver DaemonSet then mounts nvidia.conf at load time and the kernel comes up with the flag already set. The existing NVreg preflight check stays in place as a belt-and-suspenders guard for operators who override the module config at a higher layer.

  2. feat(recipes/gb200-eks): bump driver to 580.126.20 — Override gpu-operator.driver.version at the GB200/EKS overlay layer only. H100/B200 and non-EKS GB200 (OKE, GKE, AKS) keep the global 580.105.08 default from components/gpu-operator/values.yaml. The version recommendation is GB200+EFA-specific on EKS; narrower blast radius than a global bump.

  3. feat(recipes/gb200-eks): adopt nccl-all-reduce-bw-net and -nvls constraints — Replaces the inherited nccl-all-reduce-bw >= 720 from gb200-any-training with nccl-all-reduce-bw-net >= 40 and nccl-all-reduce-bw-nvls >= 500 on gb200-eks-training. ValidationPhase replaces rather than merges, so this is a clean swap on GB200/EKS recipes only; non-EKS GB200 and non-GB200 accelerators keep the legacy entry unchanged. Thresholds are deliberately conservative, sized for a 2-node GB200 pair — will be raised once production NVL72 data is available.

  4. feat(recipes/gb200-eks): extend NCCL variants + NVreg fulfillment to inference — Mirrors the above into gb200-eks-inference.yaml. NCCL all-reduce-bw is fabric-health, not training-specific: multi-node inference (tensor-parallel serving, MoE expert parallelism) crosses the same EFA + MNNVL fabrics and has the same dma-buf attach requirement. Also updates docs/user/validation.md with a 3-variant table documenting when each check is selected.

Mixin alternative, rejected. Initially tried extracting the shared GB200/EKS GPU-operator block into a mixin, but the mixin system is strictly additive — it can introduce new componentRef names but cannot extend a componentRef already declared upstream in the inheritance chain (gpu-operator comes from eks-training / eks-inference). Per-leaf duplication with a "keep in sync" comment is the pragmatic choice until the mixin system gains extension semantics; concretely that's ~10 duplicated lines across two files vs. a much larger refactor.

Testing

# Static verification
make qualify                                    # passes, 0 lint issues

# Recipe hydration
aicr query --service eks --accelerator gb200 --intent training --os ubuntu --platform kubeflow \
  --selector components.gpu-operator.values.driver            # → version 580.126.20, kernelModuleConfig.name nvidia-kernel-module-params
aicr query --service eks --accelerator gb200 --intent inference --os ubuntu \
  --selector components.gpu-operator.values.driver            # → same
aicr query --service oke --accelerator gb200 --intent training --os ubuntu \
  --selector components.gpu-operator.values.driver.version    # → 580.105.08 (unchanged)
aicr query --service eks --accelerator h100 --intent training --os ubuntu \
  --selector components.gpu-operator.values.driver.version    # → 580.105.08 (unchanged)

End-to-end on real GB200/EKS hardware (EKS 1.34, Ubuntu 24.04, 2× p6e-gb200.36xlarge, ASG-terminated before redeploy for a clean-boot driver rollout):

Check Measured Threshold
nccl-all-reduce-bw-net 329.59 GB/s ≥ 40
nccl-all-reduce-bw-nvls 841.49 GB/s ≥ 500
8 conformance checks all pass

On-cluster verification of the NVreg self-fulfillment:

  • nvidia-kernel-module-params ConfigMap present in gpu-operator namespace with options nvidia NVreg_GrdmaPciTopoCheckOverride=1
  • Driver DaemonSet image: nvcr.io/nvidia/driver:580.126.20-ubuntu24.04; ConfigMap mounted at /drivers/nvidia.conf
  • /proc/driver/nvidia/params on both GB200 nodes reports GrdmaPciTopoCheckOverride: 1 — flag is live in the loaded kernel module, not just declared

Risk Assessment

  • Low — Isolated to two GB200/EKS overlays + one user-doc page. No CLI, API, or validator-engine code changes.

Rollout notes:

  • The three behavior changes (driver 580.126.20, NVreg ConfigMap, NCCL variants) are scoped to GB200/EKS via overlay path, verified by aicr query against OKE/H100 recipes showing no drift.
  • The driver bump redeploys the nvidia-driver DaemonSet on GB200/EKS clusters that regenerate their bundle; plan this alongside a rolling-replace of GPU nodes (ASG terminate with --no-should-decrement-desired-capacity) so the new driver lands on clean-boot replacements rather than reinstalling over running kernel state.
  • No migration needed for non-GB200 or non-EKS recipes.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lintgolangci-lint run ./... → 0 issues)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality (validator-side tests for the new NCCL variants landed in feat(performance): add GB200 EKS support for NCCL all-reduce bandwidth check #640; this PR is recipe/doc only)
  • I updated docs if user-facing behavior changed (docs/user/validation.md 3-variant table)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

Wire the existing kernel-module-params ConfigMap template into the GB200/EKS
overlay and point gpu-operator ClusterPolicy at it via
driver.kernelModuleConfig.name. The NVIDIA driver DaemonSet now mounts
nvidia.conf at load time and the kernel comes up with the flag set, which is
required on GB200+EFA for EFA dma-buf attach to the Grace PCI topology.
Without the flag, NCCL silently falls back to the Socket transport.

The existing NVreg preflight check stays in place as a belt-and-suspenders
guard: it keeps its actionable error message for operators who disable the
override at a higher layer or ship a cluster with a different module config.

Scope: GB200/EKS only. The PCIe-topology issue is EKS+EFA specific; OKE,
GKE, and AKS GB200 overlays are unaffected.

Verified by bundling eks/gb200/ubuntu/training and inspecting
gpu-operator/manifests/kernel-module-params.yaml + values.yaml; h100/eks
bundle does NOT render the ConfigMap.
…floor)

Override gpu-operator.driver.version at the GB200/EKS overlay layer so
GB200+EFA recipes ship with the NVIDIA-recommended driver floor while
H100/B200 and non-EKS GB200 stay on the global 580.105.08 default in
components/gpu-operator/values.yaml.

Narrower blast radius than a global bump: the version recommendation is
specific to GB200+EFA dma-buf topology on EKS, and Skyhook compatibility
already diverges between accelerators (see the GB200 no-op comment in this
same overlay).

Verified with aicr query --selector components.gpu-operator.values.driver.version:
  gb200/eks -> 580.126.20
  h100/eks  -> 580.105.08 (unchanged)
  gb200/oke -> 580.105.08 (unchanged)
…raints

Default GB200/EKS training recipes to the two transport-specific NCCL
variants introduced earlier on this branch series. The validator Catalog
entries already exist; no overlay referenced them until now.

NET exercises EFA and NVLS exercises MNNVL across the NVL72 IMEX domain.
Each variant asserts its transport actually carried traffic (via the
verifyTransportFromLogs check in validators/performance), so a silent
fallback to Socket or NET cannot masquerade as a pass — a failure mode the
legacy nccl-all-reduce-bw check cannot distinguish.

Thresholds are deliberately conservative (NET >= 40 GB/s, NVLS >= 500 GB/s),
sized for a 2-node GB200 pair. They catch clear misconfigurations today and
will be raised once production NVL72 data is available.

Merge behavior: ValidationPhase replaces rather than merges, so this block
replaces the inherited nccl-all-reduce-bw >= 720 from gb200-any-training on
GB200/EKS recipes only. Non-EKS GB200 (OKE, etc.) and non-GB200 accelerators
keep the legacy entry unchanged.

Verified by resolving recipes for gb200/eks (NET+NVLS), gb200/oke (legacy
720), and h100/eks (legacy 300).
…inference

NCCL all-reduce-bw-net / -nvls measure fabric health (EFA inter-node +
MNNVL intra-NVL72), not anything training-specific. Multi-node inference
on GB200/EKS — tensor-parallel serving for large models, MoE expert
parallelism — crosses the same fabrics as training all-reduce and has
the same NVreg_GrdmaPciTopoCheckOverride=1 dma-buf attach requirement.

Tried an extraction into a gb200-eks-gpuops mixin first, but the mixin
system is strictly additive: a mixin can only introduce new componentRef
names, not extend one already defined in the inheritance chain (and
eks-training / eks-inference both declare gpu-operator with a valuesFile).
Falling back to per-leaf duplication with "keep in sync" comments — 34
added lines on the inference side, 0 meaningful change on training.

Changes:
 - gb200-eks-inference.yaml: gpu-operator componentRef gains the same
   kernel-module-params manifestFile + driver.kernelModuleConfig.name +
   driver.version:580.126.20 + cdi/gdrcopy overrides that landed for
   training in c162888/3c32e9ed. Also adds the nccl-all-reduce-bw-net
   (>=40) and -nvls (>=500) performance constraints.
 - gb200-eks-training.yaml: comment updated to flag the training/inference
   sync relationship; content unchanged.
 - docs/user/validation.md: documents all three NCCL variants in a table
   with platform→variant selection rules, replacing the single-variant
   description. Closes the "docs/user/validation.md still only documents
   nccl-all-reduce-bw" follow-up now that an overlay adopts the variants.

Verified via `aicr query`:
 - eks/gb200/training and eks/gb200/inference both hydrate
   driver.version=580.126.20 and kernelModuleConfig.name=
   nvidia-kernel-module-params.
 - Both carry nccl-all-reduce-bw-net/-nvls under
   validation.performance.constraints.
 - oke/gb200 and eks/h100 still hydrate driver.version=580.105.08
   (the global default) — no collateral impact.
@coderabbitai

coderabbitai Bot commented Apr 24, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

The changes update NCCL all-reduce bandwidth validation documentation and configuration for GB200 EKS deployments. Documentation in validation.md is revised to describe how validation checks are selected by recipe and platform fabric rather than using a single default approach. Two recipe files—gb200-eks-inference.yaml and gb200-eks-training.yaml—are updated with GPU operator kernel module parameter configuration and new performance validation sections that specify NCCL all-reduce checks over NET and NVLS transports with corresponding minimum bandwidth thresholds.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main changes: GB200 EKS driver version bump and adoption of NET/NVLS NCCL validation checks, matching the core modifications in both recipe overlays and documentation.
Description check ✅ Passed The description provides detailed context about wiring GB200/EKS overlays for kernel module flag self-fulfillment, driver version pinning, and transport-specific NCCL checks, with testing results and rollout guidance directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/user/validation.md`:
- Around line 43-57: Update the opening sentence to state that the NCCL
all-reduce benchmark and its three check variants apply to both training and
inference recipes (not just training), and ensure the subsequent sentence
explicitly notes that GB200/EKS recipes for both the "training" and "inference"
intents enable the `-net` and `-nvls` variants together; reference the check
names `nccl-all-reduce-bw`, `nccl-all-reduce-bw-net`, and
`nccl-all-reduce-bw-nvls` and keep the existing table and explanatory sentences
but change wording where needed so the scope clearly covers both intents.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 52bb8e9e-e13c-45ab-b571-7b68db16fde5

📥 Commits

Reviewing files that changed from the base of the PR and between 4e158cf and 84d5746.

📒 Files selected for processing (3)
  • docs/user/validation.md
  • recipes/overlays/gb200-eks-inference.yaml
  • recipes/overlays/gb200-eks-training.yaml

Comment thread docs/user/validation.md
@mchmarny mchmarny merged commit 306b785 into NVIDIA:main Apr 24, 2026
72 checks passed
@njhensley njhensley deleted the feat/gb200-eks-adoption-and-driver-bump branch June 23, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants