DGX Spark GPUDirect RDMA

chris.weisenborn · October 24, 2025, 2:47am

Does the DGX spark support GPUDirect RDMA? On a normal x86 system with a GPU and connectx card you need to have the nvidia-peermem module installed/activated before calling ibv_reg_mr with a GPU buffer or otherwise the system seg faults. However, there does not appear to be a peermem module for the spark pre-installed but I get a seg fault when trying to call ibv_reg_mr with a GPU buffer. Does peermem need to be installed or does some other module need to be installed or activated as I have seen a nvidia-p2p module mentioned for the orin? Or do I need an entirely different approach when working with the spark? Or is this just not currently supported?

chris.weisenborn · October 24, 2025, 2:55am

Specifically I get this error when trying to load the peermem module: modprobe: ERROR: could not insert ‘nvidia_peermem’: Invalid argument So maybe it does exist? But just isn’t loading for some reason.

luigifcruz · October 28, 2025, 8:20am

Same problem here. Some debug:

sonata@spark-05de:~$ sudo modprobe nvidia-peermem
[sudo] password for sonata:
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

sonata@spark-05de:~$ sudo modinfo nvidia-peermem
filename:       /lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko
version:        580.95.05
license:        Dual BSD/GPL
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     890AFFA635D55BDFFC7CFAE
depends:
name:           nvidia_peermem
vermagic:       6.11.0-1016-nvidia SMP preempt mod_unload modversions aarch64
sig_id:         PKCS#7
signer:         Canonical Ltd. Kernel Module Signing
sig_key:        E9:DF:13:0F:92:92:A9:B7
sig_hashalgo:   sha512
signature:      9A:D7:8E:63:77:91:1B:A6:83:83:C0:E8:17:92:DF:2B:B9:9A:52:C4:
		54:45:6F:87:DF:03:F8:CE:C5:F3:A8:43:8B:D5:72:A2:BC:4A:D5:44:
		56:B7:2C:FD:F1:5F:1F:A7:43:9F:27:BF:9D:AE:53:A0:94:B5:3F:31:
		AC:84:07:6A:6C:A0:2D:B3:CE:F5:1E:AF:63:26:DF:93:FB:D8:06:C5:
		A6:52:DE:B4:F3:6E:1B:4C:AA:D7:D9:40:13:5A:2B:4D:0C:56:43:0F:
		7C:40:ED:4B:7C:DA:3A:97:17:8C:A9:58:69:94:CD:02:5E:A1:2E:3E:
		B5:16:10:22:BD:0F:26:8F:8A:D2:55:B4:21:BD:C4:D7:57:EC:AC:F6:
		FD:18:CA:F7:70:C8:26:E9:E7:86:F3:BF:F8:D3:74:EE:E1:04:AF:EF:
		ED:D2:AA:08:3B:17:F5:47:00:47:C4:B8:6C:B3:5C:B2:58:A0:BE:01:
		C2:55:0F:F9:90:B8:6E:F1:B6:4E:9C:C4:6E:B2:87:6C:D2:56:68:E8:
		8B:CB:70:51:4E:E4:ED:89:56:31:7F:66:26:60:53:BB:4A:0A:5D:C8:
		5E:26:8E:EE:C7:AC:84:2B:80:2A:B2:48:40:4E:7D:85:E7:71:BF:ED:
		BD:A9:A9:40:70:CA:BE:25:95:DD:39:38:A5:F3:29:E4:53:58:C3:E0:
		78:EA:7A:D5:30:1A:AC:7B:49:EF:08:AB:A1:19:EC:FD:4E:2D:0D:59:
		6E:39:71:BD:A0:DA:2D:33:5E:14:F1:7D:F2:2D:C0:C2:5B:A8:E0:FD:
		1C:E7:0A:40:39:7B:6A:64:FE:D7:10:51:D0:1F:35:68:72:F0:40:30:
		8A:05:FC:15:84:E1:96:09:99:2B:3C:D5:04:7D:50:B7:23:DE:07:AA:
		19:FC:3B:8F:94:AA:55:E2:AF:28:4C:13:96:04:8B:55:D7:66:3E:B5:
		6B:A8:11:AE:D3:C9:1D:F6:61:A8:29:57:7E:2F:44:A9:9E:78:15:0B:
		0F:9C:6D:D9:1E:5E:31:19:E1:20:AB:E5:3B:BE:F0:72:AA:F0:B3:63:
		33:FE:DA:DA:23:FC:87:A7:59:46:68:8B:DD:E2:87:EB:46:BE:78:3C:
		DD:BC:9B:F6:DA:78:13:2C:FC:0F:40:31:48:65:BF:38:BD:A1:92:F6:
		B4:51:68:17:E4:DA:F9:DB:5E:5C:94:E5:FA:4F:0F:78:6A:3E:71:66:
		C7:C0:A2:4E:60:0A:4F:05:51:71:B4:8E:29:B2:01:59:9E:4F:F0:1C:
		74:53:CD:54:4D:33:DD:3A:81:75:F8:38:EC:1D:92:AF:E2:D7:7A:21:
		D6:1F:F8:DB:99:74:D8:20:51:71:75:68
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
parm:           persistent_api_support:Set level of support for persistent APIs, 0 [legacy] or 1 [default] (int)

The dmesg is clean with nothing from peermem there.

chris.weisenborn · October 29, 2025, 2:50am

So, my dmesg was also clean and modinfo for my peermem module looks the same. The interesting thing about it though is the depends: line. It is empty whereas on a normal system with peermem it looks like:

depends: nvidia,ib_uverbs

This led me to investigate the module further as I don’t understand why the spark module wouldn’t also depend on the nvidia and ib_uverbs module. So, I ran the following:

objdump -d /lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko

And got:

/lib/modules/6.11.0-1016-nvidia/kernel/nvidia-580-open/nvidia-peermem.ko:     file format elf64-littleaarch64


Disassembly of section .init.text:

0000000000000000 <init_module-0x8>:
   0:	d503201f 	nop
   4:	d503201f 	nop

0000000000000008 <init_module>:
   8:	d503201f 	nop
   c:	d503201f 	nop
  10:	128002a0 	mov	w0, #0xffffffea            	// #-22
  14:	d65f03c0 	ret

Disassembly of section .exit.text:

0000000000000000 <cleanup_module>:
   0:	d65f03c0 	ret

Disassembly of section .plt:

0000000000000000 <.plt>:
	...

Disassembly of section .text.ftrace_trampoline:

0000000000000000 <.text.ftrace_trampoline>:
	...

This is why the module returns the invalid argument. It is in essence empty. However, the code for the peermem module does seem to be present on the spark in the /usr/src/nvidia-580.95.05/nvidia-peermem folder. Looking through the code, it seems like the module expects NV_MLNX_IB_PEER_MEM_SYMBOLS_PRESENT to be defined or otherwise the module’s init method just returns –EINVAL which becomes the ‘invalid argument’ we see when running modprobe. See here:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/kernel-open/nvidia-peermem/nvidia-peermem.c#L641

Also, NV_MLNX_IB_PEER_MEM_SYMBOLS_PRESENT seems to get defined here when building the module:

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/2b436058a616676ec888ef3814d1db6b2220f2eb/kernel-open/conftest.sh#L3277

I assume that it was not defined when the module on the spark was built as there is no /usr/src/ofa_kernel folder or dkms source for ofed on the spark or wherever the module was built. But that is why I think the module is empty and does not load.

So maybe the DOCA-OFED drivers need to be installed, and the module needs to be rebuilt/replaced? However, the DGX OS user guide specifically states that the spark doesn’t require it:

So, I’m not really sure what is going on. Maybe someone from nvidia could respond and give an idea whether GPU Direct/peermem should work, will be supported at some point, isn’t intended to work, or some other method should be used as an alternative.

MackenzieNVIDIA · October 29, 2025, 3:29am

DGX Spark SoC is characterized by a unified memory architecture.

For performance reasons, specifically for CUDA contexts associated to the iGPU, the system memory returned by the pinned device memory allocators (e.g. cudaMalloc) cannot be coherently accessed by the CPU complex nor by I/O peripherals like PCI Express devices.

Hence the GPUDirect RDMA technology is not supported, and the mechanisms for direct I/O based on that technology, for example nvidia-peermem (for DOCA-Host), dma-buf or GDRCopy, do not work.

A compliant application should programmatically introspect the relevant platform capabilities, e.g. by querying CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_SUPPORTED (related to nv-p2p kernel APIs) or CU_DEVICE_ATTRIBUTE_DMA_BUF_SUPPORT (related to dma-buf), and leverage an appropriate fallback.

For example, for Linux RDMA applications based on the ib verbs library, we suggest to allocate the communication buffers with the cudaHostAlloc API and to register them with the ib_reg_mr function.

system · November 17, 2025, 7:32pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
GPU Direct RDMA Not Working on DGX Spark Systems - nvidia-peermem Module Fails to Load DGX Spark / GB10	2	185	November 2, 2025
Enabling GPU Direct RDMA for DGX Spark Clustering DGX Spark / GB10 gpu	11	394	December 11, 2025
Rivermax & GPUDirect Network Management Products gpu , inception , rivermax	5	2565	October 6, 2022
RDMA Questions RDMA Software For GPU	4	1105	December 20, 2023
GPU Storage (GDS) to SSD with RDMA GPU-Accelerated Libraries cuda , python , gds	4	343	June 2, 2025
GPUDirect RDMA -- perftest not actually writing to GPU RDMA Software For GPU	1	44	December 7, 2025
RDMA GPUDirect//nvidia-peer-memory/cuda issue RDMA Software For GPU software-and-drivers , howto-enable-verify-and-troubleshoo	11	2417	September 12, 2019
GPUDirect RDMA at the ibverbs level. Software And Drivers iterations , bytes	4	1778	November 30, 2020
Clarification on requirements for GPUDirect RDMA CUDA Programming and Performance	16	4870	November 7, 2023
Is there any documentation about nv_peer_mem and nvidia_peermem? CUDA Programming and Performance	0	1381	August 28, 2021

DGX Spark GPUDirect RDMA

Related topics