-
Notifications
You must be signed in to change notification settings - Fork 1.5k
DynUNet crashes with DataParallel and DeepSupervision #6442
Description
Describe the bug
DynUNet crashes in a torch.nn.DataParallel scenario, since a mutable list is used to get the supervision heads.
MONAI/monai/networks/nets/dynunet.py
Lines 212 to 219 in 5f344cc
| return DynUNetSkipLayer( | |
| index, | |
| downsample=downsamples[0], | |
| upsample=upsamples[0], | |
| next_layer=next_layer, | |
| heads=self.heads, | |
| super_head=superheads[0], | |
| ) |
MONAI/monai/networks/nets/dynunet.py
Line 51 in 5f344cc
| self.heads[self.index - 1] = self.super_head(upout) |
This does not work for multiple GPUs in this scenarios, because we end up with tensors in the list having different CUDA devices. The code crashes when stacking the tensors in the list at:
MONAI/monai/networks/nets/dynunet.py
Lines 271 to 275 in 5f344cc
| if self.training and self.deep_supervision: | |
| out_all = [out] | |
| for feature_map in self.heads: | |
| out_all.append(interpolate(feature_map, out.shape[2:])) | |
| return torch.stack(out_all, dim=1) |
To Reproduce
Run torch.nn.DataParallel(DynUNet(..., deep_supervision=True), device_ids=[0, 1])
Expected behavior
DynUNet forward should be threadsafe. I know that DistributedDataParallel is superior and would solve the problem, however, it should still work by correctly storing results from block return values instead of using a "global" mutable list.
Environment
================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.24.2
Pytorch version: 1.14.0a0+410ce96
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /usr/local/lib/python3.8/dist-packages/monai/__init__.py
Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.2.2
scikit-image version: 0.19.3
Pillow version: 9.4.0
Tensorboard version: 2.12.0
gdown version: NOT INSTALLED or UNKNOWN VERSION.
TorchVision version: 0.15.0a0
tqdm version: 4.64.1
lmdb version: NOT INSTALLED or UNKNOWN VERSION.
psutil version: 5.9.4
pandas version: 1.5.3
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.
pynrrd version: NOT INSTALLED or UNKNOWN VERSION.
For details about installing the optional dependencies, please visit:
https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies
================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 20.04.5 LTS
Platform: Linux-5.15.0-67-generic-x86_64-with-glibc2.29
Processor: x86_64
Machine: x86_64
Python version: 3.8.10
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 48
Num logical CPUs: 48
Num usable CPUs: 48
CPU usage (%): [4.9, 5.4, 4.4, 4.4, 4.9, 4.4, 4.4, 4.4, 5.3, 4.9, 4.9, 4.9, 4.9, 5.3, 4.9, 5.3, 4.9, 4.4, 6.3, 4.9, 4.9, 4.4, 4.4, 5.3, 4.9, 4.9, 4.4, 4.9, 4.4, 4.9, 4.4, 4.9, 5.3, 4.9, 4.4, 4.9, 4.4, 4.9, 4.4, 4.9, 4.9, 4.4, 4.9, 4.4, 4.9, 4.4, 4.9, 99.5]
CPU freq. (MHz): 1646
Load avg. in last 1, 5, 15 mins (%): [0.2, 1.2, 6.7]
Disk usage (%): 60.9
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 1007.8
Available memory (GB): 991.8
Used memory (GB): 9.7
================================
Printing GPU config...
================================
Num GPUs: 2
Has CUDA: True
CUDA version: 11.8
cuDNN enabled: True
cuDNN version: 8700
Current device: 0
Library compiled for CUDA architectures: ['sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'compute_90']
GPU 0 Name: NVIDIA RTX A6000
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 47.5
GPU 0 CUDA capability (maj.min): 8.6
GPU 1 Name: NVIDIA RTX A6000
GPU 1 Is integrated: False
GPU 1 Is multi GPU board: False
GPU 1 Multi processor count: 84
GPU 1 Total memory (GB): 47.5
GPU 1 CUDA capability (maj.min): 8.6