Segfault when testing GPU RDMA on Connectx-7 IB on Ubuntu 22.04

Hardware:
GPU: 8x H100-80G-SXM with 4x NVSwitch
Device Type: ConnectX7
- Description: NVIDIA ConnectX-7 Single Port Infiniband NDR OSFP Adapter
- Versions: Current Available
- FW 28.41.1000 N/A
- PXE 3.7.0400 N/A
- UEFI 14.34.0012 N/A

Software Stacks:
- Datacenter Driver: 570.133.20 (open-kernel version)
- Cuda toolkit: 12.8
- MOFED: MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu22.04-x86_64
- Ubuntu:22.04.03 on Baremetal
- kernel: 5.15.0-139-generic
- gdrcopy: 2.5 (release)
- ucx: 1.18.1
- perftest: 25.01.0 (release)

- [x] ACS has been disabled on OS
sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
…

- [x] ATS has been disabled on all 8x CX-7
mlxconfig -d mlx5_0 query | grep ATS_ENABLED; done
ATS_ENABLED False(0)

Sanity Test:

1. GDRcopy has passed
$ gdrcopy_sanity
Total: 36, Passed: 31, Failed: 0, Waived: 5
List of waived tests:
basic_v2_forcepci_cumemalloc
basic_v2_forcepci_vmmalloc
basic_with_tokens
data_validation_mix_mappings_cumemalloc
data_validation_v2_forcepci_cumemalloc

2. perftest without cuda works, got expected 396Gb/s on all message size

3. **Got segfault perftest with cuda in both cases:
using adding --use_cuda_dmabuf
without it (i.e., using nvidia_peermem)**

$ gdb -q --args ./ib_send_bw -a --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
Reading symbols from ./ib_send_bw…
(gdb) run
Starting program: /home/vmware/perftest-25.01.0/ib_send_bw -a -q 4 --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
WARNING: BW peak won’t be measured in this run.
Perftest doesn’t supports CUDA tests with inline messages: inline size set to 0

Waiting for client to connect… *
initializing CUDA
[New Thread 0x7ffff359d640 (LWP 81912)]
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 19:00
CUDA device 1: PCIe address is 3B:00
CUDA device 2: PCIe address is 4C:00
CUDA device 3: PCIe address is 5D:00
CUDA device 4: PCIe address is 9B:00
CUDA device 5: PCIe address is BB:00
CUDA device 6: PCIe address is CB:00
CUDA device 7: PCIe address is DB:00

Picking device No. 0
[pid = 81871, dev = 0] device name = [NVIDIA H100 80GB HBM3]
creating CUDA Ctx
[New Thread 0x7ffde1339640 (LWP 81939)]
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 67108864 bytes GPU buffer
allocated GPU buffer address at 00007ffdb2000000 pointer=0x7ffdb2000000
using DMA-BUF for GPU buffer address at 0x7ffdb2000000 aligned at 0x7ffdb2000000 with aligned size 67108864
Calling ibv_reg_dmabuf_mr(offset=0, size=67108864, addr=0x7ffdb2000000, fd=67) for QP #0
                Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x09 QPN 0x0067 PSN 0xbd0ea
local address: LID 0x09 QPN 0x0068 PSN 0x39998c
local address: LID 0x09 QPN 0x0069 PSN 0xfba9e6
local address: LID 0x09 QPN 0x006a PSN 0xfa9a3d
remote address: LID 0x08 QPN 0x0071 PSN 0x92add
remote address: LID 0x08 QPN 0x0072 PSN 0xed3803
remote address: LID 0x08 QPN 0x0073 PSN 0x890c91
remote address: LID 0x08 QPN 0x0074 PSN 0x773f0c
[bytes](https://forums.developer.nvidia.com/tag/bytes)
 
[iterations](https://forums.developer.nvidia.com/tag/iterations)
 BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

Thread 1 “ib_send_bw” received signal SIGSEGV, Segmentation fault.
__memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
373 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
#1 0x00007ffff7e82b06 in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#2 0x00007ffff7e5c34e in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
#4 run_iter_bw_server (ctx=ctx@entry=0x7fffffffcd80, user_param=user_param@entry=0x7fffffffcfc0) at src/perftest_resources.c:3832
#5 0x000055555555c4e3 in main (argc=, argv=) at src/send_bw.c:458
(gdb) frame 3
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
2927 return cq->context->ops.poll_cq(cq, num_entries, wc);

GDB shows the issue resides in libmlx5.so and in cq->context->ops.poll_cq, so I suspect the reason could reside in the MLX-OFED.
Could you suggest a specific combination of datacenter driver version, MLX-OFED / DOCA-OFED version, cuda version, nccl version, gdrcopy verision, perftest version to make GDR validation work on ubuntu?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Segfault when testing GPU RDMA on Connectx-7 IB on Ubuntu 22.04 #326

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segfault when testing GPU RDMA on Connectx-7 IB on Ubuntu 22.04 #326

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions