feat: GB200 EKS NET/NVLS NCCL validation and driver bump#668
Conversation
Wire the existing kernel-module-params ConfigMap template into the GB200/EKS overlay and point gpu-operator ClusterPolicy at it via driver.kernelModuleConfig.name. The NVIDIA driver DaemonSet now mounts nvidia.conf at load time and the kernel comes up with the flag set, which is required on GB200+EFA for EFA dma-buf attach to the Grace PCI topology. Without the flag, NCCL silently falls back to the Socket transport. The existing NVreg preflight check stays in place as a belt-and-suspenders guard: it keeps its actionable error message for operators who disable the override at a higher layer or ship a cluster with a different module config. Scope: GB200/EKS only. The PCIe-topology issue is EKS+EFA specific; OKE, GKE, and AKS GB200 overlays are unaffected. Verified by bundling eks/gb200/ubuntu/training and inspecting gpu-operator/manifests/kernel-module-params.yaml + values.yaml; h100/eks bundle does NOT render the ConfigMap.
…floor) Override gpu-operator.driver.version at the GB200/EKS overlay layer so GB200+EFA recipes ship with the NVIDIA-recommended driver floor while H100/B200 and non-EKS GB200 stay on the global 580.105.08 default in components/gpu-operator/values.yaml. Narrower blast radius than a global bump: the version recommendation is specific to GB200+EFA dma-buf topology on EKS, and Skyhook compatibility already diverges between accelerators (see the GB200 no-op comment in this same overlay). Verified with aicr query --selector components.gpu-operator.values.driver.version: gb200/eks -> 580.126.20 h100/eks -> 580.105.08 (unchanged) gb200/oke -> 580.105.08 (unchanged)
…raints Default GB200/EKS training recipes to the two transport-specific NCCL variants introduced earlier on this branch series. The validator Catalog entries already exist; no overlay referenced them until now. NET exercises EFA and NVLS exercises MNNVL across the NVL72 IMEX domain. Each variant asserts its transport actually carried traffic (via the verifyTransportFromLogs check in validators/performance), so a silent fallback to Socket or NET cannot masquerade as a pass — a failure mode the legacy nccl-all-reduce-bw check cannot distinguish. Thresholds are deliberately conservative (NET >= 40 GB/s, NVLS >= 500 GB/s), sized for a 2-node GB200 pair. They catch clear misconfigurations today and will be raised once production NVL72 data is available. Merge behavior: ValidationPhase replaces rather than merges, so this block replaces the inherited nccl-all-reduce-bw >= 720 from gb200-any-training on GB200/EKS recipes only. Non-EKS GB200 (OKE, etc.) and non-GB200 accelerators keep the legacy entry unchanged. Verified by resolving recipes for gb200/eks (NET+NVLS), gb200/oke (legacy 720), and h100/eks (legacy 300).
…inference NCCL all-reduce-bw-net / -nvls measure fabric health (EFA inter-node + MNNVL intra-NVL72), not anything training-specific. Multi-node inference on GB200/EKS — tensor-parallel serving for large models, MoE expert parallelism — crosses the same fabrics as training all-reduce and has the same NVreg_GrdmaPciTopoCheckOverride=1 dma-buf attach requirement. Tried an extraction into a gb200-eks-gpuops mixin first, but the mixin system is strictly additive: a mixin can only introduce new componentRef names, not extend one already defined in the inheritance chain (and eks-training / eks-inference both declare gpu-operator with a valuesFile). Falling back to per-leaf duplication with "keep in sync" comments — 34 added lines on the inference side, 0 meaningful change on training. Changes: - gb200-eks-inference.yaml: gpu-operator componentRef gains the same kernel-module-params manifestFile + driver.kernelModuleConfig.name + driver.version:580.126.20 + cdi/gdrcopy overrides that landed for training in c162888/3c32e9ed. Also adds the nccl-all-reduce-bw-net (>=40) and -nvls (>=500) performance constraints. - gb200-eks-training.yaml: comment updated to flag the training/inference sync relationship; content unchanged. - docs/user/validation.md: documents all three NCCL variants in a table with platform→variant selection rules, replacing the single-variant description. Closes the "docs/user/validation.md still only documents nccl-all-reduce-bw" follow-up now that an overlay adopts the variants. Verified via `aicr query`: - eks/gb200/training and eks/gb200/inference both hydrate driver.version=580.126.20 and kernelModuleConfig.name= nvidia-kernel-module-params. - Both carry nccl-all-reduce-bw-net/-nvls under validation.performance.constraints. - oke/gb200 and eks/h100 still hydrate driver.version=580.105.08 (the global default) — no collateral impact.
📝 WalkthroughWalkthroughThe changes update NCCL all-reduce bandwidth validation documentation and configuration for GB200 EKS deployments. Documentation in Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/user/validation.md`:
- Around line 43-57: Update the opening sentence to state that the NCCL
all-reduce benchmark and its three check variants apply to both training and
inference recipes (not just training), and ensure the subsequent sentence
explicitly notes that GB200/EKS recipes for both the "training" and "inference"
intents enable the `-net` and `-nvls` variants together; reference the check
names `nccl-all-reduce-bw`, `nccl-all-reduce-bw-net`, and
`nccl-all-reduce-bw-nvls` and keep the existing table and explanatory sentences
but change wording where needed so the scope clearly covers both intents.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 52bb8e9e-e13c-45ab-b571-7b68db16fde5
📒 Files selected for processing (3)
docs/user/validation.mdrecipes/overlays/gb200-eks-inference.yamlrecipes/overlays/gb200-eks-training.yaml
Summary
Wire the GB200/EKS overlays (
trainingandinference) to self-fulfill theNVreg_GrdmaPciTopoCheckOverride=1kernel module flag, pin the driver to the NVIDIA-recommended floor for GB200+EFA (580.126.20), and adopt the transport-specificnccl-all-reduce-bw-net/-nvlsperformance checks that were introduced in #640.Motivation / Context
After #640 landed, the NET + NVLS NCCL variants existed in the validator catalog but no shipped overlay referenced them — GB200/EKS users still ran the legacy auto-detect
nccl-all-reduce-bwand got no transport-specific signal. Separately, GB200/EKS has two operator-side prerequisites that the existing overlay left as manual configuration:NVreg_GrdmaPciTopoCheckOverride=1— required on the NVIDIA driver so EFA can attach dma-buf to the Grace PCI topology. Without it, NCCL silently falls back to Socket on NET.This PR closes both gaps declaratively in the overlay, so GB200/EKS recipes come up correctly out of the box, and both the training and inference intents assert the two fabrics (EFA + MNNVL) actually carry traffic.
Fixes: N/A
Related: #640 (validator-side NET/NVLS implementation)
Type of Change
Component(s) Affected
pkg/recipe)docs/,examples/)Implementation Notes
Four commits, each independently revertable:
feat(recipes/gb200-eks): self-fulfill NVreg_GrdmaPciTopoCheckOverride=1— AddsmanifestFiles: [components/gpu-operator/manifests/kernel-module-params.yaml](existing template, already in the component directory) and pointsClusterPolicy.spec.driver.kernelModuleConfig.nameat the resulting ConfigMap. The NVIDIA driver DaemonSet then mountsnvidia.confat load time and the kernel comes up with the flag already set. The existing NVreg preflight check stays in place as a belt-and-suspenders guard for operators who override the module config at a higher layer.feat(recipes/gb200-eks): bump driver to 580.126.20— Overridegpu-operator.driver.versionat the GB200/EKS overlay layer only. H100/B200 and non-EKS GB200 (OKE, GKE, AKS) keep the global 580.105.08 default fromcomponents/gpu-operator/values.yaml. The version recommendation is GB200+EFA-specific on EKS; narrower blast radius than a global bump.feat(recipes/gb200-eks): adopt nccl-all-reduce-bw-net and -nvls constraints— Replaces the inheritednccl-all-reduce-bw >= 720fromgb200-any-trainingwithnccl-all-reduce-bw-net >= 40andnccl-all-reduce-bw-nvls >= 500ongb200-eks-training.ValidationPhasereplaces rather than merges, so this is a clean swap on GB200/EKS recipes only; non-EKS GB200 and non-GB200 accelerators keep the legacy entry unchanged. Thresholds are deliberately conservative, sized for a 2-node GB200 pair — will be raised once production NVL72 data is available.feat(recipes/gb200-eks): extend NCCL variants + NVreg fulfillment to inference— Mirrors the above intogb200-eks-inference.yaml. NCCL all-reduce-bw is fabric-health, not training-specific: multi-node inference (tensor-parallel serving, MoE expert parallelism) crosses the same EFA + MNNVL fabrics and has the same dma-buf attach requirement. Also updatesdocs/user/validation.mdwith a 3-variant table documenting when each check is selected.Mixin alternative, rejected. Initially tried extracting the shared GB200/EKS GPU-operator block into a mixin, but the mixin system is strictly additive — it can introduce new
componentRefnames but cannot extend acomponentRefalready declared upstream in the inheritance chain (gpu-operatorcomes fromeks-training/eks-inference). Per-leaf duplication with a "keep in sync" comment is the pragmatic choice until the mixin system gains extension semantics; concretely that's ~10 duplicated lines across two files vs. a much larger refactor.Testing
End-to-end on real GB200/EKS hardware (EKS 1.34, Ubuntu 24.04, 2×
p6e-gb200.36xlarge, ASG-terminated before redeploy for a clean-boot driver rollout):nccl-all-reduce-bw-netnccl-all-reduce-bw-nvlsOn-cluster verification of the NVreg self-fulfillment:
nvidia-kernel-module-paramsConfigMap present ingpu-operatornamespace withoptions nvidia NVreg_GrdmaPciTopoCheckOverride=1nvcr.io/nvidia/driver:580.126.20-ubuntu24.04; ConfigMap mounted at/drivers/nvidia.conf/proc/driver/nvidia/paramson both GB200 nodes reportsGrdmaPciTopoCheckOverride: 1— flag is live in the loaded kernel module, not just declaredRisk Assessment
Rollout notes:
aicr queryagainst OKE/H100 recipes showing no drift.nvidia-driverDaemonSet on GB200/EKS clusters that regenerate their bundle; plan this alongside a rolling-replace of GPU nodes (ASG terminate with--no-should-decrement-desired-capacity) so the new driver lands on clean-boot replacements rather than reinstalling over running kernel state.Checklist
make testwith-race)make lint—golangci-lint run ./...→ 0 issues)docs/user/validation.md3-variant table)git commit -S)