Skip to content

Segfault when testing GPU RDMA on Connectx-7 IB on Ubuntu 22.04 #326

@qoofyk

Description

@qoofyk

Hardware:
GPU: 8x H100-80G-SXM with 4x NVSwitch
Device Type: ConnectX7

  • Description: NVIDIA ConnectX-7 Single Port Infiniband NDR OSFP Adapter
  • Versions: Current Available
  • FW 28.41.1000 N/A
  • PXE 3.7.0400 N/A
  • UEFI 14.34.0012 N/A

Software Stacks:

  • Datacenter Driver: 570.133.20 (open-kernel version)

  • Cuda toolkit: 12.8

  • MOFED: MLNX_OFED_LINUX-24.10-2.1.8.0-ubuntu22.04-x86_64

  • Ubuntu:22.04.03 on Baremetal

  • kernel: 5.15.0-139-generic

  • gdrcopy: 2.5 (release)

  • ucx: 1.18.1

  • perftest: 25.01.0 (release)

  • ACS has been disabled on OS
    sudo lspci -vvv | grep ACSCtl
    ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-

  • ATS has been disabled on all 8x CX-7
    mlxconfig -d mlx5_0 query | grep ATS_ENABLED; done
    ATS_ENABLED False(0)

Sanity Test:

  1. GDRcopy has passed
    $ gdrcopy_sanity
    Total: 36, Passed: 31, Failed: 0, Waived: 5
    List of waived tests:
    basic_v2_forcepci_cumemalloc
    basic_v2_forcepci_vmmalloc
    basic_with_tokens
    data_validation_mix_mappings_cumemalloc
    data_validation_v2_forcepci_cumemalloc

  2. perftest without cuda works, got expected 396Gb/s on all message size

  3. Got segfault perftest with cuda in both cases:
    using adding --use_cuda_dmabuf
    without it (i.e., using nvidia_peermem)

$ gdb -q --args ./ib_send_bw -a --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
Reading symbols from ./ib_send_bw…
(gdb) run
Starting program: /home/vmware/perftest-25.01.0/ib_send_bw -a -q 4 --report_gbits -d mlx5_0 --use_cuda=0 --use_cuda_dmabuf
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
WARNING: BW peak won’t be measured in this run.
Perftest doesn’t supports CUDA tests with inline messages: inline size set to 0

Waiting for client to connect… *
initializing CUDA
[New Thread 0x7ffff359d640 (LWP 81912)]
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 19:00
CUDA device 1: PCIe address is 3B:00
CUDA device 2: PCIe address is 4C:00
CUDA device 3: PCIe address is 5D:00
CUDA device 4: PCIe address is 9B:00
CUDA device 5: PCIe address is BB:00
CUDA device 6: PCIe address is CB:00
CUDA device 7: PCIe address is DB:00

Picking device No. 0
[pid = 81871, dev = 0] device name = [NVIDIA H100 80GB HBM3]
creating CUDA Ctx
[New Thread 0x7ffde1339640 (LWP 81939)]
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 67108864 bytes GPU buffer
allocated GPU buffer address at 00007ffdb2000000 pointer=0x7ffdb2000000
using DMA-BUF for GPU buffer address at 0x7ffdb2000000 aligned at 0x7ffdb2000000 with aligned size 67108864
Calling ibv_reg_dmabuf_mr(offset=0, size=67108864, addr=0x7ffdb2000000, fd=67) for QP #0
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 4 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
RX depth : 512
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
local address: LID 0x09 QPN 0x0067 PSN 0xbd0ea
local address: LID 0x09 QPN 0x0068 PSN 0x39998c
local address: LID 0x09 QPN 0x0069 PSN 0xfba9e6
local address: LID 0x09 QPN 0x006a PSN 0xfa9a3d
remote address: LID 0x08 QPN 0x0071 PSN 0x92add
remote address: LID 0x08 QPN 0x0072 PSN 0xed3803
remote address: LID 0x08 QPN 0x0073 PSN 0x890c91
remote address: LID 0x08 QPN 0x0074 PSN 0x773f0c
bytes

iterations
BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]

Thread 1 “ib_send_bw” received signal SIGSEGV, Segmentation fault.
__memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
373 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) bt
#0 __memmove_avx512_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:373
#1 0x00007ffff7e82b06 in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#2 0x00007ffff7e5c34e in ?? () from /lib/x86_64-linux-gnu/libmlx5.so.1
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
#4 run_iter_bw_server (ctx=ctx@entry=0x7fffffffcd80, user_param=user_param@entry=0x7fffffffcfc0) at src/perftest_resources.c:3832
#5 0x000055555555c4e3 in main (argc=, argv=) at src/send_bw.c:458
(gdb) frame 3
#3 0x0000555555579ce6 in ibv_poll_cq (wc=0x5555555ddfa0, num_entries=16, cq=) at /usr/include/infiniband/verbs.h:2927
2927 return cq->context->ops.poll_cq(cq, num_entries, wc);

GDB shows the issue resides in libmlx5.so and in cq->context->ops.poll_cq, so I suspect the reason could reside in the MLX-OFED.
Could you suggest a specific combination of datacenter driver version, MLX-OFED / DOCA-OFED version, cuda version, nccl version, gdrcopy verision, perftest version to make GDR validation work on ubuntu?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions