**TL;DR
It appears we can achieve 400G aggregate by connecting the DGX Spark with two cables.
Can we enable GPU Direct RDMA to potentially achieve closer to 400 Gbps (50 GB/s) aggregate bandwidth on the next firmware update?
Dual 200G QSFP Setup Achieving Only 208 Gbps - GPU Direct RDMA Issues on ARM64**
This is as close as I got:
Environment Details:
Hardware:
- 2x NVIDIA DGX Spark (Blackwell GB10 GPUs)
- 2x 200G QSFP56 passive DAC cables (MCP1650-V00AE30 + Q56-200G-CU0-5)
- NVIDIA ConnectX-7 200G NICs (mlx5_core driver)
Software:
- OS: Ubuntu 24.04 LTS (Noble)
- Kernel: 6.14.0-1013-nvidia (ARM64/aarch64)
- NVIDIA Driver: 580.95.05 (open-source nvidia driver)
- CUDA: 13.0
- NCCL: 2.28.3+cuda13.0 (built with Blackwell compute_121 support)
- InfiniBand: mlx5_ib, ib_core loaded and working
Current Setup
Successfully configured dual 200G cables between two DGX Spark systems:
- Cable 1:
enp1s0f0np0- 192.168.100.x subnet - Cable 2:
enp1s0f1np1- 192.168.101.x subnet
NCCL correctly detects and bonds both cables:
- NET/IB : Made virtual device [4] name=rocep1s0f0+rocep1s0f1 speed=400000 ndevs=2
- NET/IB : Made virtual device [5] name=roceP2p1s0f0+roceP2p1s0f1 speed=400000 ndevs=2
Performance Results
NCCL all_gather_perf test (8GB buffer):
- Achieved: 25.67 GB/s (~208 Gbps)
- Theoretical max: 50 GB/s (400 Gbps)
- Efficiency: ~52%
NCCL reports GPU Direct RDMA is disabled:
NET/IB : GPU Direct RDMA Disabled for HCA 0 ‘rocep1s0f0’
NET/IB : GPU Direct RDMA Disabled for HCA 1 ‘rocep1s0f1’
Problem: nvidia-peermem Module Won’t Load
The nvidia-peermem.ko module exists but fails to load:
$ sudo modprobe nvidia-peermem
modprobe: ERROR: could not insert ‘nvidia_peermem’: Invalid argument
Module information:
$ modinfo nvidia-peermem
filename: /lib/modules/6.14.0-1013-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
version:580.95.05
vermagic: 6.14.0-1013-nvidia SMP preempt mod_unload modversions aarch64
depends:
InfiniBand modules are loaded:
`$ lsmod | grep -E "ib_core|mlx5"`
`mlx5_ib 503808 0`
`ib_uverbs 200704 2 rdma_ucm,mlx5_ib`
`ib_core 524288 11 rdma_cm,ib_ipoib,rpcrdma,ib_srp,iw_cm,ib_iser,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm`
`mlx5_core 2678784 1 mlx5_ib`
Unfortunately, kernel logs don’t provide specific error details (dmesg access restricted).
Questions
1. GPU Direct RDMA on ARM64 DGX Spark
Is GPU Direct RDMA supported on ARM64 architecture with DGX Spark? Should it work with the open-source NVIDIA driver (580.95.05)?
2. nvidia-peermem Module Loading
Why would nvidia-peermem.ko fail with “Invalid argument”?
Are there:
- Specific kernel configuration requirements?
- Dependencies on proprietary NVIDIA drivers?
- Known issues with ARM64 peer memory support?
3. Expected Performance Without GPU Direct RDMA
Is 208 Gbps (52% efficiency) reasonable for dual 200G links without GPU Direct RDMA? Or should we expect better performance even without it?
4. Alternative Optimization Methods
Without GPU Direct RDMA, are there other ways to improve dual-link performance:
- Specific NCCL environment variables?
- Different collective algorithms (all_reduce vs all_gather)?
- PCIe or IOMMU optimizations?
5. Module Parameters
The module has these parameters
parm: peerdirect_support: 0 [default] or 1 [legacy, for MLNX_OFED 4.9 LTS]
parm: persistent_api_support: 0 [legacy] or 1 [default]
Neither peerdirect_support=1norpersistent_api_support=0 helps load the module. Are there other parameters to try?
What’s been attempted:
- ✅ Verified physical cables connected and link up
- ✅ Configured both interfaces with separate subnets
- ✅ Set NCCL environment variables for dual interfaces
export UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=rocep1s0f0,rocep1s0f1
- ✅ NCCL successfully bonds both cables (400G virtual device)
- ❌ Cannot load nvidia-peermem module
- ❌ GPU Direct RDMA remains disabled




