NCCL Errors on _init_with_tcp_store

### Description

I am struggling to get distributed Cupy to work on a system. I have it working on one (its a 2xa6000 system) but cannot get it on another (2xGH200). Im not sure if its an architecture difference but I get the error:

```bash
Traceback (most recent call last):
  File "/u/priyamm2/MyTorch/tests/all_reduce_test_wo_launch.py", line 27, in <module>
    main(args.rank, args.world_size)
  File "/u/priyamm2/MyTorch/tests/all_reduce_test_wo_launch.py", line 8, in main
    comm = NCCLBackend(n_devices=world_size, rank=rank, host="127.0.0.1", port=13333)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/u/priyamm2/.conda/envs/sidetorch/lib/python3.12/site-packages/cupyx/distributed/_nccl_comm.py", line 82, in __init__
    self._init_with_tcp_store(n_devices, rank, host, port)
  File "/u/priyamm2/.conda/envs/sidetorch/lib/python3.12/site-packages/cupyx/distributed/_nccl_comm.py", line 105, in _init_with_tcp_store
    shifted_nccl_id = bytes([b + 128 for b in nccl_id])
```

Here is the system information:

```bash
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GH200 120GB             On  |   00000019:01:00.0 Off |                    0 |
| N/A   23C    P0            140W /  900W |     677MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GH200 120GB             On  |   00000029:01:00.0 Off |                    0 |
| N/A   22C    P0            122W /  900W |     675MiB /  97871MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    311576      C   python                                        554MiB |
|    1   N/A  N/A    311577      C   python                                        554MiB |
+-----------------------------------------------------------------------------------------+
```


### To Reproduce

```py
import cupy as cp
from cupyx.distributed import NCCLBackend
import argparse

def main(rank, world_size):
    cp.cuda.Device(rank).use()

    comm = NCCLBackend(n_devices=world_size, rank=rank, host="127.0.0.1", port=13333)

    # Each rank creates its own tensor
    x = cp.ones(4, dtype=cp.float32) * (rank + 1)
    y = cp.zeros_like(x)

    print(f"[Rank {rank}] Before allreduce: {x}")

    # out_array must be passed in (y will store the result)
    comm.all_reduce(x, y, op="sum")

    print(f"[Rank {rank}] After allreduce: {y}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--rank", type=int, required=True)
    parser.add_argument("--world_size", type=int, required=True)
    args = parser.parse_args()

    main(args.rank, args.world_size)
```

And I run it with:

```bash
CUPYX_DISTRIBUTED_HOST=127.0.0.1 CUPYX_DISTRIBUTED_PORT=13333 \
python tests/all_reduce_test_wo_launch.py --rank 0 --world_size 2 &

CUPYX_DISTRIBUTED_HOST=127.0.0.1 CUPYX_DISTRIBUTED_PORT=13333 \
python tests/all_reduce_test_wo_launch.py --rank 1 --world_size 2
```

This again works totally fine in my other system, but it wont work her?




### Installation

Conda-Forge (`conda install ...`)

### Environment

```
OS                           : Linux-5.14.21-150500.55.65_13.0.73-cray_shasta_c_64k-aarch64-with-glibc2.31
Python Version               : 3.12.12
CuPy Version                 : 13.6.0
CuPy Platform                : NVIDIA CUDA
NumPy Version                : 2.3.3
SciPy Version                : None
Cython Build Version         : 3.1.3
Cython Runtime Version       : None
CUDA Root                    : /u/priyamm2/.conda/envs/sidetorch
nvcc PATH                    : /u/priyamm2/.conda/envs/sidetorch/bin/nvcc
CUDA Build Version           : 12090
CUDA Driver Version          : 12040
CUDA Runtime Version         : 12090 (linked to CuPy) / 12030 (locally installed)
CUDA Extra Include Dirs      : ['/u/priyamm2/.conda/envs/sidetorch/targets/sbsa-linux/include', '/u/priyamm2/.conda/envs/sidetorch/include']
cuBLAS Version               : 120304
cuFFT Version                : 11012
cuRAND Version               : 10304
cuSOLVER Version             : (11, 5, 4)
cuSPARSE Version             : 12200
NVRTC Version                : (12, 3)
Thrust Version               : 200802
CUB Build Version            : 200802
Jitify Build Version         : <unknown>
cuDNN Build Version          : None
cuDNN Version                : None
NCCL Build Version           : 22707
NCCL Runtime Version         : 22707
cuTENSOR Version             : 20301
cuSPARSELt Build Version     : None
Device 0 Name                : NVIDIA GH200 120GB
Device 0 Compute Capability  : 90
Device 0 PCI Bus ID          : 0009:01:00.0
```


### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL Errors on _init_with_tcp_store #9430

Description

To Reproduce

Installation

Environment

Additional Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

NCCL Errors on _init_with_tcp_store #9430

Description

Description

To Reproduce

Installation

Environment

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions