Hello, I’m trying to make a gpu direct storage to ssd, but i think that in the installation something is wrong.
When i make this command: /usr/local/cuda-12.4/gds/tools/gdscheck.py -p i obtain in my terminal:
GDS release version: 1.9.1.3
nvidia_fs version: 2.17 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Enabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported only on (nvidia-fs version <= 2.17.4)
Cuda Driver Version Installed: 12040
Platform: Precision 7960 Rack, Arch: x86_64(Linux 6.2.0-060200-generic)
Platform verification succeeded
In the image you can see that NVMe is not supported, rdma library is not loaded, rdma devices are not configured and the rdma device status is not UP = 1. Moreover, I’m not sure about the line " Nvidia Driver Info Status: Supported only on (nvidia-fs version <= 2.17.4)", is this a warning, or i have to change the version of nvidia-fs?
Also, I check that nvidia_fs and nvidia_peermem are loaded:
celis@celis-bb-rpr-1:~$ lsmod | grep nvidia_peermem
nvidia_peermem 16384 0
nvidia 54411264 32 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_uverbs 196608 3 nvidia_peermem,rdma_ucm,mlx5_ib
(base) celis@celis-bb-rpr-1:~$ lsmod | grep nvidia_fs
nvidia_fs 278528 0
Versions installed:
My kernel version is: 6.2.0-060200-generic
CUDA version: release 12.4, V12.4.131
Nvidia Driver version: 550.163.01
DOCA-OFED version: OFED-internal-25.01-0.6.0:
GPU NVIDIA: RTX 6000 Ada
Nvidia-fs version: 2.17.4
GDS version: 1.9.1.3
Looks like you have proprietory RM driver installed. nvidia-fs.ko is a GPL V2 module and needs OpenRM for newer linux versions.
please try with installing the open RM diver and also install the required NVMe support with DOCA
1 Like
Perfect, now i have the NVMe supported, but i have the mellanox peerdirect disabled, rdma library is not loaded, rdma devices is not configured and rdma device status is 0 in up. We install the nvidia driver open 560, and the kernel is 6.8.0.52-generic.
GDS release version: 1.9.1.3
nvidia_fs version: 2.25 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 16384
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 33554432
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 4
execution.max_io_queue_depth : 128
execution.parallel_io : true
execution.min_io_threshold_size_kb : 8192
execution.max_request_parallelism : 4
properties.force_odirect_mode : false
properties.prefer_iouring : false
=========
GPU INFO:
=========
GPU index 0 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
GPU index 1 NVIDIA RTX 6000 Ada Generation bar:1 bar size (MiB):256 supports GDS, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
IOMMU: disabled
Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
Cuda Driver Version Installed: 12060
Platform: Precision 7960 Rack, Arch: x86_64(Linux 6.8.0-52-generic)
Platform verification succeeded
cat /proc/driver/nvidia-fs/stats
GDS Version: 1.14.0.28
NVFS statistics(ver: 4.0)
NVFS Driver(version: 2.25.6)
Mellanox PeerDirect Supported: False
IO stats: Disabled, peer IO stats: Disabled
Logging level: info
Active Shadow-Buffer (MiB): 0
Active Process: 0
Reads : err=0 io_state_err=0
Sparse Reads : n=0 io=0 holes=0 pages=0
Writes : err=0 io_state_err=0 pg-cache=0 pg-cache-fail=0 pg-cache-eio=0
Mmap : n=0 ok=0 err=0 munmap=0
Bar1-map : n=0 ok=0 err=0 free=0 callbacks=0 active=0 delay-frees=0
Error : cpu-gpu-pages=0 sg-ext=0 dma-map=0 dma-ref=0
Ops : Read=0 Write=0 BatchIO=0```
Also with the command ofed_info we can see the different versions installed for your information:
Thank you!!
Perfect, now i have the NVMe supported, but i have the mellanox peerdirect disabled, rdma library is not loaded, rdma devices is not configured and rdma device status is 0 in up.
you don’t need nvidia_peermem or dmabuf support if you need only local SSD support.
nvidia_peermem or dmabuf support is only needed for userspace RDMA , WEKA FS or GPFS.
if you need to configure userspace RDMA for GPFS or WEKA, please see individual sections at
1. NVIDIA GPUDirect Storage Installation and Troubleshooting Guide — NVIDIA GPUDirect Storage Installation and Troubleshooting Guide,
Yes, but in my case I need to have RDMA with peermem as I have a connectX through which we transfer to NVMe disks and GPUs hosted on other servers.