Enabling GPU Direct RDMA for DGX Spark Clustering

**TL;DR

It appears we can achieve 400G aggregate by connecting the DGX Spark with two cables.
Can we enable GPU Direct RDMA to potentially achieve closer to 400 Gbps (50 GB/s) aggregate bandwidth on the next firmware update?
Dual 200G QSFP Setup Achieving Only 208 Gbps - GPU Direct RDMA Issues on ARM64**

This is as close as I got:

Environment Details:

Hardware:

  • 2x NVIDIA DGX Spark (Blackwell GB10 GPUs)
  • 2x 200G QSFP56 passive DAC cables (MCP1650-V00AE30 + Q56-200G-CU0-5)
  • NVIDIA ConnectX-7 200G NICs (mlx5_core driver)

Software:

  • OS: Ubuntu 24.04 LTS (Noble)
  • Kernel: 6.14.0-1013-nvidia (ARM64/aarch64)
  • NVIDIA Driver: 580.95.05 (open-source nvidia driver)
  • CUDA: 13.0
  • NCCL: 2.28.3+cuda13.0 (built with Blackwell compute_121 support)
  • InfiniBand: mlx5_ib, ib_core loaded and working

Current Setup

Successfully configured dual 200G cables between two DGX Spark systems:

  • Cable 1: enp1s0f0np0 - 192.168.100.x subnet
  • Cable 2: enp1s0f1np1 - 192.168.101.x subnet

NCCL correctly detects and bonds both cables:

  • NET/IB : Made virtual device [4] name=rocep1s0f0+rocep1s0f1 speed=400000 ndevs=2
  • NET/IB : Made virtual device [5] name=roceP2p1s0f0+roceP2p1s0f1 speed=400000 ndevs=2

Performance Results

NCCL all_gather_perf test (8GB buffer):

  • Achieved: 25.67 GB/s (~208 Gbps)
  • Theoretical max: 50 GB/s (400 Gbps)
  • Efficiency: ~52%

NCCL reports GPU Direct RDMA is disabled:

NET/IB : GPU Direct RDMA Disabled for HCA 0 ‘rocep1s0f0’
NET/IB : GPU Direct RDMA Disabled for HCA 1 ‘rocep1s0f1’

Problem: nvidia-peermem Module Won’t Load

The nvidia-peermem.ko module exists but fails to load:

$ sudo modprobe nvidia-peermem

modprobe: ERROR: could not insert ‘nvidia_peermem’: Invalid argument

Module information:

$ modinfo nvidia-peermem
filename: /lib/modules/6.14.0-1013-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
version:580.95.05
vermagic: 6.14.0-1013-nvidia SMP preempt mod_unload modversions aarch64
depends:

InfiniBand modules are loaded:

`$ lsmod | grep -E "ib_core|mlx5"`
`mlx5_ib               503808  0`
`ib_uverbs             200704  2 rdma_ucm,mlx5_ib`
`ib_core               524288  11 rdma_cm,ib_ipoib,rpcrdma,ib_srp,iw_cm,ib_iser,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm`
`mlx5_core            2678784  1 mlx5_ib`

Unfortunately, kernel logs don’t provide specific error details (dmesg access restricted).

Questions

1. GPU Direct RDMA on ARM64 DGX Spark

Is GPU Direct RDMA supported on ARM64 architecture with DGX Spark? Should it work with the open-source NVIDIA driver (580.95.05)?

2. nvidia-peermem Module Loading

Why would nvidia-peermem.ko fail with “Invalid argument”?

Are there:

  • Specific kernel configuration requirements?
  • Dependencies on proprietary NVIDIA drivers?
  • Known issues with ARM64 peer memory support?

3. Expected Performance Without GPU Direct RDMA

Is 208 Gbps (52% efficiency) reasonable for dual 200G links without GPU Direct RDMA? Or should we expect better performance even without it?

4. Alternative Optimization Methods

Without GPU Direct RDMA, are there other ways to improve dual-link performance:

  • Specific NCCL environment variables?
  • Different collective algorithms (all_reduce vs all_gather)?
  • PCIe or IOMMU optimizations?

5. Module Parameters

The module has these parameters

parm: peerdirect_support: 0 [default] or 1 [legacy, for MLNX_OFED 4.9 LTS]
parm: persistent_api_support: 0 [legacy] or 1 [default]

Neither peerdirect_support=1norpersistent_api_support=0 helps load the module. Are there other parameters to try?

What’s been attempted:

  • ✅ Verified physical cables connected and link up
  • ✅ Configured both interfaces with separate subnets
  • ✅ Set NCCL environment variables for dual interfaces
export UCX_NET_DEVICES=enp1s0f0np0,enp1s0f1np1
export NCCL_SOCKET_IFNAME=enp1s0f0np0,enp1s0f1np1
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=rocep1s0f0,rocep1s0f1
  • ✅ NCCL successfully bonds both cables (400G virtual device)
  • ❌ Cannot load nvidia-peermem module
  • ❌ GPU Direct RDMA remains disabled

Q: Is GPUDirect RDMA supported on DGX Spark?

DGX Spark SoC is characterized by a unified memory architecture.

For performance reasons, specifically for CUDA contexts associated to the iGPU, the system memory returned by the pinned device memory allocators (e.g. cudaMalloc) cannot be coherently accessed by the CPU complex nor by I/O peripherals like PCI Express devices.

Hence the GPUDirect RDMA technology is not supported, and the mechanisms for direct I/O based on that technology, for example nvidia-peermem (for DOCA-Host), dma-buf or GDRCopy, do not work.

A compliant application should programmatically introspect the relevant platform capabilities, e.g. by querying CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED (related to nv-p2p kernel APIs) or CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORT (related to dma-buf), and leverage an appropriate fallback.

For example, for Linux RDMA applications based on the ib verbs library, we suggest to allocate the communication buffers with the cudaHostAlloc API and to register them with the ib_reg_mr function.

1 Like

So, it’s not really a true unified memory then?

Can this also be a reason for terrible mmap performance? It kind of makes sense now, because I suspect instead of nvme → DMA → RAM path, it goes nvme → DMA → RAM → memcpy → (V)RAM or something like that?

2 Likes

Before I dropped the note above that was one of the assumptions Claude code made, assuming the current path looks like this not factoring in the unified memory of each individual Spark:

GPU Memory → PCIe → CPU Memory → PCIe → Network Card → Network
^^^^^^^^^^
BOTTLENECK!

I find this to have been miss leading. The whole point of the DGX Spark was to allow for an environment to simulate similar hardware characteristics of a full system.

Are there other options for zero-copy from the NIC to GPU memory? Would something like this be possible with future firmware updates?

The Failed to insert module ‘nvidia_peermem’: Invalid agrument is not specific to DGX Spark. I get those log entries on a Dell (not Spark) system with RTX A4000:

Nov 23 12:02:08 texas systemd-modules-load[463]: Failed to insert module 'nvidia_peermem': Invalid argument
Nov 23 12:02:09 texas systemd-modules-load[1008]: Failed to insert module 'nvidia_peermem': Invalid argument
Nov 23 12:02:10 texas systemd-modules-load[1041]: Failed to insert module 'nvidia_peermem': Invalid argument

Some info why at DGX Spark GPUDirect RDMA

There are no other options for zero-copying from NIC to GPU memory and a firmware update won’t add it from what I can tell. This is due to the architecture of the system. Unlike it’s bigger brothers, the GB10 has no memory (normally HBM in the larger systems) attached to the GPU. So there is no memory to directly load data to from the NIC. There is just LPDDR5 memory attached to the CPU. So the GPU is always accessing data across the infiniband C2C (chip to chip) link that connects the CPU and GPU as far I can tell. This means that you can RDMA data to a host allocated buffer and then just pass that pointer to CUDA without a H2D copy from my personal testing, since the virtual addressing appears to be the same. But your mileage may vary. I do agree though that the marketing around this device could have been better as this is really just 3/4 of a larger DGX system and may not be a great representation of how one of the larger systems operate.

While a little dated here is a good diagram that represents what I think the architecture for the full fat grace blackwell systems looks like: grace-hopper-overiew

And here is a terrible photoshop of that picture that I think depicts roughly the architecture of the spark:

TLDR: No GPU memory, no GPUDirect RDMA.

1 Like

Chris, thanks for the reply. I’m still wrapping my head around this unified memory. After investigating this further. I can still write data from the NIC to memory as normal and then access that same memory space from CUDA without an additional copy, correct?

So if i were to be using something like DPDK, i could poll the NIC and write my data using cudaMallocManaged This would make it available to both the CPU and GPU, correct?

It makes sense that we don’t need the RDMA, GPUDirect stuff, because it’s in the same memory space. But ideal, we want to be able to access that memory space without an additional copy

First off, I am still trying to wrap my head around all of this as well. So take everything I say with a grain of salt as all of it is from my own testing and nvidia could come along and tell me I am wrong.

I have no personal experience with DPDK so that may be different. But from my own testing I have used cudaHostAllocand just plain C malloc as both are able to be registered with RDMA. The pointers returned by either of those methods can be passed to a CUDA kernel directly and seem to have similar performance (thus no additional memory copy, just RDMA→host mem→kernel). cudaHostAlloc may have some benefit since it is page locked. But I don’t seem to need to use the unified memory from cudaMallocManaged nor do I seem to need to specifically allocate GPU buffers with cudaMallocand do a H2D copy. Also, I am not sure if the pointer returned from cudaMallocManaged can be registered with RDMA as I have not tested that. But host buffers would typically not work in a CUDA kernel on a standard system with a dedicated GPU as the memory addressing isn’t the same as the GPU has its own dedicated memory and can’t directly access the CPU memory without a H2D copy (cudaMallocManaged handles the H2D and D2H copying automatically from what I understand). That is way I think they say it is characterized as a unified memory system as the virtual address spacing is the same but the GPU doesn’t have direct access to the memory like a truly unified memory system. Instead it has to access it through the CPU.

Edit: Some further details. I actually do think a system with a dedicated GPU might be able to access CPU host memory directly through PCIE and at PCIE speeds (so much slower) when using cudaHostAlloc but I can’t remember at this point. Also the unified memory of cudaMallocManaged is a bit misleading as it has no relation to the unified memory on the spark. cudaMallocManaged work on x86 systems with a dedicated GPU and they call it unified because it manages copying it back and forth (D2H, H2D) automatically for you. I think that calling that unified was a bad idea as it is confusing with actual unified memory.

1 Like

Sorry for the blunt question, but if GDS and P2Pmem don’t work at all on Spark, what was the intended use case for such a high-end 200Gb dual-port ConnectX-7 in the first place?

It’s still an ultra-low latency Infiniband connection. The intended use is to connect two Sparks in the cluster.
Would it be better if it supported GDS? Yes. Is it usable as is? Also yes. For large-ish dense models it increases inference performance 2x in vllm. Training gains are even better.

You don’t benefit from GPU Direct if you’re writing your own code because of unified memory. The memory pointers for host and device are identical. So if a GPU writes to itself, and the CPU just pulls that memory pointer directly without using cudaMemcpy, you get maximum bandwidth. Then over 200GbE, you’re basically saturating the network connection already and can push data from GPU2 to GPU1 at >22GB/sec.

If you’re using legacy software that falls down to cudaMemcpy as fallback when it doesn’t detect GPUDirect, it’s potentially slower. Even though cudaMemcpy has more bandwidth than 200GbE, you are adding latency.

I wrote a benchmark to show this. It will only work for 2 nodes and has zero security features. It assumes that if you can SSH to a machine, you’re allowed to copy and execute code. The host copies the benchmark to ~/nccl_benchmark upon deployment.

NCCLBenchmark-DGXSpark-AlanBCDang.zip (35.5 MB)

┌─ GPU→GPU DIRECT (Rank 1 → Rank 0, unidirectional) ──────────┐
Tests: GPU2 buffer → network → GPU1 buffer (no CPU copies)
This is the path GPUDirect RDMA optimizes on discrete GPUs.
On unified memory, we get equivalent performance without it.
Size Bandwidth Latency Eff. OK
8 MB 20.39 GB/s 411.5 µs 81.5% ✓
16 MB 21.45 GB/s 782.3 µs 85.8% ✓
32 MB 21.48 GB/s 1562.4 µs 85.9% ✓
64 MB 22.31 GB/s 3007.6 µs 89.3% ✓

2 Likes