-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
🐛 Bug
When trying to run fairscale unittests with torch >= 1.8.0 and cuda 11.1, I am getting many CUBLAS failures This did not happen with 1.7.1. I've also tried March 30 nightly torch 1.9.0 and see the same error.
attn_output_weights = torch.bmm(q, k.transpose(1, 2))
E RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when callingcublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2192/workflows/9d16d5d4-d104-4aee-971e-4eb329da80c8/jobs/14498
This does not occur with torch 1.7.1 and cuda 11.1
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
You can get the script and run it with:
wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
PyTorch version: 1.9.0.dev20210330+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.5.1
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla M60
GPU 1: Tesla M60
GPU 2: Tesla M60
GPU 3: Tesla M60
Nvidia driver version: 455.32.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0.dev20210330+cu111
[pip3] torchtext==0.6.0
[pip3] torchvision==0.10.0.dev20210330+cu111
[conda] Could not collect
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda,pip, source): - Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
Additional context
cc @ngimel