Description
I am struggling to get distributed Cupy to work on a system. I have it working on one (its a 2xa6000 system) but cannot get it on another (2xGH200). Im not sure if its an architecture difference but I get the error:
Traceback (most recent call last):
File "/u/priyamm2/MyTorch/tests/all_reduce_test_wo_launch.py", line 27, in <module>
main(args.rank, args.world_size)
File "/u/priyamm2/MyTorch/tests/all_reduce_test_wo_launch.py", line 8, in main
comm = NCCLBackend(n_devices=world_size, rank=rank, host="127.0.0.1", port=13333)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/u/priyamm2/.conda/envs/sidetorch/lib/python3.12/site-packages/cupyx/distributed/_nccl_comm.py", line 82, in __init__
self._init_with_tcp_store(n_devices, rank, host, port)
File "/u/priyamm2/.conda/envs/sidetorch/lib/python3.12/site-packages/cupyx/distributed/_nccl_comm.py", line 105, in _init_with_tcp_store
shifted_nccl_id = bytes([b + 128 for b in nccl_id])
Here is the system information:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GH200 120GB On | 00000019:01:00.0 Off | 0 |
| N/A 23C P0 140W / 900W | 677MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GH200 120GB On | 00000029:01:00.0 Off | 0 |
| N/A 22C P0 122W / 900W | 675MiB / 97871MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 311576 C python 554MiB |
| 1 N/A N/A 311577 C python 554MiB |
+-----------------------------------------------------------------------------------------+
To Reproduce
import cupy as cp
from cupyx.distributed import NCCLBackend
import argparse
def main(rank, world_size):
cp.cuda.Device(rank).use()
comm = NCCLBackend(n_devices=world_size, rank=rank, host="127.0.0.1", port=13333)
# Each rank creates its own tensor
x = cp.ones(4, dtype=cp.float32) * (rank + 1)
y = cp.zeros_like(x)
print(f"[Rank {rank}] Before allreduce: {x}")
# out_array must be passed in (y will store the result)
comm.all_reduce(x, y, op="sum")
print(f"[Rank {rank}] After allreduce: {y}")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--rank", type=int, required=True)
parser.add_argument("--world_size", type=int, required=True)
args = parser.parse_args()
main(args.rank, args.world_size)
And I run it with:
CUPYX_DISTRIBUTED_HOST=127.0.0.1 CUPYX_DISTRIBUTED_PORT=13333 \
python tests/all_reduce_test_wo_launch.py --rank 0 --world_size 2 &
CUPYX_DISTRIBUTED_HOST=127.0.0.1 CUPYX_DISTRIBUTED_PORT=13333 \
python tests/all_reduce_test_wo_launch.py --rank 1 --world_size 2
This again works totally fine in my other system, but it wont work her?
Installation
Conda-Forge (conda install ...)
Environment
OS : Linux-5.14.21-150500.55.65_13.0.73-cray_shasta_c_64k-aarch64-with-glibc2.31
Python Version : 3.12.12
CuPy Version : 13.6.0
CuPy Platform : NVIDIA CUDA
NumPy Version : 2.3.3
SciPy Version : None
Cython Build Version : 3.1.3
Cython Runtime Version : None
CUDA Root : /u/priyamm2/.conda/envs/sidetorch
nvcc PATH : /u/priyamm2/.conda/envs/sidetorch/bin/nvcc
CUDA Build Version : 12090
CUDA Driver Version : 12040
CUDA Runtime Version : 12090 (linked to CuPy) / 12030 (locally installed)
CUDA Extra Include Dirs : ['/u/priyamm2/.conda/envs/sidetorch/targets/sbsa-linux/include', '/u/priyamm2/.conda/envs/sidetorch/include']
cuBLAS Version : 120304
cuFFT Version : 11012
cuRAND Version : 10304
cuSOLVER Version : (11, 5, 4)
cuSPARSE Version : 12200
NVRTC Version : (12, 3)
Thrust Version : 200802
CUB Build Version : 200802
Jitify Build Version : <unknown>
cuDNN Build Version : None
cuDNN Version : None
NCCL Build Version : 22707
NCCL Runtime Version : 22707
cuTENSOR Version : 20301
cuSPARSELt Build Version : None
Device 0 Name : NVIDIA GH200 120GB
Device 0 Compute Capability : 90
Device 0 PCI Bus ID : 0009:01:00.0
Additional Information
No response
Description
I am struggling to get distributed Cupy to work on a system. I have it working on one (its a 2xa6000 system) but cannot get it on another (2xGH200). Im not sure if its an architecture difference but I get the error:
Here is the system information:
To Reproduce
And I run it with:
CUPYX_DISTRIBUTED_HOST=127.0.0.1 CUPYX_DISTRIBUTED_PORT=13333 \ python tests/all_reduce_test_wo_launch.py --rank 0 --world_size 2 & CUPYX_DISTRIBUTED_HOST=127.0.0.1 CUPYX_DISTRIBUTED_PORT=13333 \ python tests/all_reduce_test_wo_launch.py --rank 1 --world_size 2This again works totally fine in my other system, but it wont work her?
Installation
Conda-Forge (
conda install ...)Environment
Additional Information
No response