GPU Storage (GDS) to SSD with RDMA

Hello, I’m trying to make a gpu direct storage to ssd, but i think that in the installation something is wrong.
When i make this command: /usr/local/cuda-12.4/gds/tools/gdscheck.py -p i obtain in my terminal:

 GDS release version: 1.9.1.3
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 4
 execution.max_io_queue_depth : 128
 execution.parallel_io : true
 execution.min_io_threshold_size_kb : 8192
 execution.max_request_parallelism : 4
 properties.force_odirect_mode : false
 properties.prefer_iouring : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported only on (nvidia-fs version <= 2.17.4)
 Cuda Driver Version Installed:  12040
 Platform: Precision 7960 Rack, Arch: x86_64(Linux 6.2.0-060200-generic)
 Platform verification succeeded

In the image you can see that NVMe is not supported, rdma library is not loaded, rdma devices are not configured and the rdma device status is not UP = 1. Moreover, I’m not sure about the line " Nvidia Driver Info Status: Supported only on (nvidia-fs version <= 2.17.4)", is this a warning, or i have to change the version of nvidia-fs?

Also, I check that nvidia_fs and nvidia_peermem are loaded:

celis@celis-bb-rpr-1:~$ lsmod | grep nvidia_peermem
nvidia_peermem         16384  0
nvidia              54411264  32 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_uverbs             196608  3 nvidia_peermem,rdma_ucm,mlx5_ib
(base) celis@celis-bb-rpr-1:~$ lsmod | grep nvidia_fs
nvidia_fs             278528  0

Versions installed:
My kernel version is: 6.2.0-060200-generic
CUDA version: release 12.4, V12.4.131
Nvidia Driver version: 550.163.01
DOCA-OFED version: OFED-internal-25.01-0.6.0:
GPU NVIDIA: RTX 6000 Ada
Nvidia-fs version: 2.17.4
GDS version: 1.9.1.3

Looks like you have proprietory RM driver installed. nvidia-fs.ko is a GPL V2 module and needs OpenRM for newer linux versions.

please try with installing the open RM diver and also install the required NVMe support with DOCA

1 Like

Perfect, now i have the NVMe supported, but i have the mellanox peerdirect disabled, rdma library is not loaded, rdma devices is not configured and rdma device status is 0 in up. We install the nvidia driver open 560, and the kernel is 6.8.0.52-generic.

GDS release version: 1.9.1.3
nvidia_fs version:  2.25 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe               : Supported
NVMeOF             : Unsupported
SCSI               : Unsupported
ScaleFlux CSD      : Unsupported
NVMesh             : Unsupported
DDN EXAScaler      : Unsupported
IBM Spectrum Scale : Unsupported
NFS                : Unsupported
BeeGFS             : Unsupported
WekaFS             : Unsupported
Userspace RDMA     : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library        : Not Loaded (libcufile_rdma.so)
--rdma devices        : Not configured
--rdma_device_status  : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384 
properties.posix_pool_slab_count : 128 64 32 
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed:  12060
Platform: Precision 7960 Rack, Arch: x86_64(Linux 6.8.0-52-generic)
Platform verification succeeded
cat /proc/driver/nvidia-fs/stats
GDS Version: 1.14.0.28 
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.25.6)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Disabled
Logging level: info
 
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads				: err=0 io_state_err=0
Sparse Reads		        : n=0 io=0 holes=0 pages=0 
Writes				: err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap				: n=0 ok=0 err=0 munmap=0
Bar1-map			: n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error				: cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops				: Read=0 Write=0 BatchIO=0```

Also with the command ofed_info we can see the different versions installed for your information:

Thank you!!

Perfect, now i have the NVMe supported, but i have the mellanox peerdirect disabled, rdma library is not loaded, rdma devices is not configured and rdma device status is 0 in up.

you don’t need nvidia_peermem or dmabuf support if you need only local SSD support.

nvidia_peermem or dmabuf support is only needed for userspace RDMA , WEKA FS or GPFS.

if you need to configure userspace RDMA for GPFS or WEKA, please see individual sections at
1. NVIDIA GPUDirect Storage Installation and Troubleshooting Guide — NVIDIA GPUDirect Storage Installation and Troubleshooting Guide,

Yes, but in my case I need to have RDMA with peermem as I have a connectX through which we transfer to NVMe disks and GPUs hosted on other servers.