TL/DR:
Set env variable NCCL_ALGO=Tree if you meet accuracy problems with NCCL in A800 hardware.
Hello
We found a bug about all reduce on A800 GPU when NCCL_ALGO uses Ring, and we can provide minimum reproduce steps.
We conducted comparative experiments on the A100 and A800 platforms respectively, and found that the model running on the A100 platform can converge, but the A800 platform cannot converge.
The minimum reproduce steps can be shown below:
codebase: https://github.com/karpathy/nanoGPT
Reproduce steps:
1. Prepare one node with 8x A800 and one node with 8x A100, and set the same seed=1024.
2. torchrun --nnodes=1 --nproc_per_node=8 train.py config/train_shakespeare_char.py
As expected, the loss of A800 should be the same as that of A100. However, when we set the backend to gloo, we can obtain the same loss, but when the backend is set to nccl, the loss output is inconsistent.
Furthermore, we found that if NCCL_ALGO=Tree is set, the loss remains consistent. However, if NCCL_ALGO=Ring or is not set, the loss cannot be kept consistent between A100/A800.
Additionally, when we use 8 nodes with IB connection, with one GPU card per node and set NCCL_ALGO=Ring, the loss can be kept consistent.
Therefore, we guess that there might be a bug in the current all_reduce implementation when NCCL_ALGO=Ring for A800 platform, and this bug might somehow related to the number of NVLink channels.
Note: A800 is a restricted version of A100 GPU. The only diference between A100/A800 is the number of NVLink channels: A100 has 24 channels; A800 has 16 channels.
TL/DR:
Set env variable NCCL_ALGO=Tree if you meet accuracy problems with NCCL in A800 hardware.
Hello
We found a bug about all reduce on A800 GPU when NCCL_ALGO uses Ring, and we can provide minimum reproduce steps.
We conducted comparative experiments on the A100 and A800 platforms respectively, and found that the model running on the A100 platform can converge, but the A800 platform cannot converge.
The minimum reproduce steps can be shown below:
As expected, the loss of A800 should be the same as that of A100. However, when we set the backend to
gloo, we can obtain the same loss, but when the backend is set tonccl, the loss output is inconsistent.Furthermore, we found that if
NCCL_ALGO=Treeis set, the loss remains consistent. However, ifNCCL_ALGO=Ringor is not set, the loss cannot be kept consistent between A100/A800.Additionally, when we use 8 nodes with IB connection, with one GPU card per node and set
NCCL_ALGO=Ring, the loss can be kept consistent.Therefore, we guess that there might be a bug in the current all_reduce implementation when
NCCL_ALGO=Ringfor A800 platform, and this bug might somehow related to the number of NVLink channels.Note: A800 is a restricted version of A100 GPU. The only diference between A100/A800 is the number of NVLink channels: A100 has 24 channels; A800 has 16 channels.