Skip to content

Running CRF layer raises CUDA error: an illegal memory access was encountered #2098

@masadcv

Description

@masadcv

Describe the bug
Running CRF layer raises "CUDA error: an illegal memory access was encountered"
I am running CRF on volumetric data having probability with 2 classes. Typical input shapes for my cases are: (1, 2, 512, 512, 96) and (1, 1, 512, 512, 96). I have also tried with shapes (1, 2, 128, 128, 96) and (1, 1, 128, 128, 96) it fails with same error after a few calls (see example below).

To Reproduce
Steps to reproduce the behavior:

  1. Install using:
BUILD_MONAI=1 pip install git+https://github.com/Project-MONAI/MONAI#egg=monai
  1. Run code
import torch
import monai

print('Pytorch version: {}'.format(torch.__version__))
print('MONAI version: {}\n\n'.format(monai.__version__))

from monai.networks.blocks import CRF

device = torch.device('cuda')

print('Total GPU memory: {}mb'.format(torch.cuda.get_device_properties(device).total_memory/(1024**2)))
sh = 512
in1 = torch.nn.functional.softmax(torch.rand((1, 2, sh, sh, 96)).cuda(), dim=1) # make probability
in2 = torch.rand((1, 1, sh, sh, 96)).cuda()
print('Allocated memory: {}mb'.format(torch.cuda.memory_allocated(device=device)/(1024**2))) # Allocated memory: 396.0mb
print('Input shapes: {} | {}'.format(tuple(in1.shape), tuple(in2.shape)))
out = CRF(3.0, 1.0, 5.0, 0.5, 5.0, 1, 5)(in1, in2) # RuntimeError: CUDA error: an illegal memory access was encountered

or run with smaller input shape multiple times as:

print('Total GPU memory: {}mb'.format(torch.cuda.get_device_properties(device).total_memory/(1024**2)))
ITR = 50
for itr in range(ITR):
  sh = 128
  in1 = torch.nn.functional.softmax(torch.rand((1, 2, sh, sh, 96)).cuda(), dim=1) # make probability
  in2 = torch.rand((1, 1, sh, sh, 96)).cuda()

  print('Iteration: {}/{} | Allocated memory: {}mb'.format(itr, ITR-1, torch.cuda.memory_allocated(device=device)/(1024**2))) # Allocated memory: 30.0mb
  print('Input shapes: {} | {}'.format(tuple(in1.shape), tuple(in2.shape)))
  out = CRF(3.0, 1.0, 5.0, 0.5, 5.0, 1, 5)(in1, in2) # After few iterations: RuntimeError: CUDA error: an illegal memory access was encountered
  
  # double check to ensure memory is cleared
  del in1
  del in2

Both above cases give the following error message:

/usr/local/lib/python3.7/dist-packages/monai/networks/layers/filtering.py in forward(ctx, input, features, sigmas)
     91 
     92         ctx.save_for_backward(scaled_features)
---> 93         output_data = _C.phl_filter(input, scaled_features)
     94         return output_data
     95 

RuntimeError: CUDA error: an illegal memory access was encountered

I have put the code along with error messages into this colab notebook:
https://colab.research.google.com/drive/18lSBxYjZgkcPM96bponiyKutwWoUKTUu?usp=sharing

Expected behavior
It should work with the given input shapes. It does not seem like there is a GPU out of memory issue here as both cases in the above example consume < 500mb memory on GPU.

Environment
See: https://colab.research.google.com/drive/18lSBxYjZgkcPM96bponiyKutwWoUKTUu?usp=sharing
for complete notebook with all messages. Copied below is output of monai config:

================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 18.04.5 LTS
Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic
Processor: x86_64
Machine: x86_64
Python version: 3.7.10
Process name: python3
Command: ['python3', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 1
Num logical CPUs: 2
Num usable CPUs: 2
CPU usage (%): [26.5, 100.0]
CPU freq. (MHz): UNKNOWN for given OS
Load avg. in last 1, 5, 15 mins (%): UNKNOWN for given OS
Disk usage (%): 57.9
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 12.7
Available memory (GB): 10.8
Used memory (GB): 3.1

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 10.1
cuDNN enabled: True
cuDNN version: 7603
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70']
GPU 0 Name: Tesla K80
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 13
GPU 0 Total memory (GB): 11.2
GPU 0 CUDA capability (maj.min): 3.7

Additional context
The above toy example is for illustrating/reproducing the error. I first saw this error while working with real volumetric data. I have omitted those from this report as it is reproducible with/without them.

cc: @charliebudd

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions