Skip to content

CUBLAS_STATUS_EXECUTION_FAILED error on torch >= 1.8.0 and CUDA 11.1 #54975

@msbaines

Description

@msbaines

🐛 Bug

When trying to run fairscale unittests with torch >= 1.8.0 and cuda 11.1, I am getting many CUBLAS failures This did not happen with 1.7.1. I've also tried March 30 nightly torch 1.9.0 and see the same error.

attn_output_weights = torch.bmm(q, k.transpose(1, 2))
E RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)
https://app.circleci.com/pipelines/github/facebookresearch/fairscale/2192/workflows/9d16d5d4-d104-4aee-971e-4eb329da80c8/jobs/14498

This does not occur with torch 1.7.1 and cuda 11.1

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py

PyTorch version: 1.9.0.dev20210330+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.7 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.5.1

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: Tesla M60
GPU 1: Tesla M60
GPU 2: Tesla M60
GPU 3: Tesla M60

Nvidia driver version: 455.32.00
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] torch==1.9.0.dev20210330+cu111
[pip3] torchtext==0.6.0
[pip3] torchvision==0.10.0.dev20210330+cu111
[conda] Could not collect

  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

cc @ngimel

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: cudaRelated to torch.cuda, and CUDA support in generalneeds reproductionEnsure you have actionable steps to reproduce the issue. Someone else needs to confirm the repro.triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions