Skip to content

test_autorunner_gpu_customization assumes visible device gpu 0 #6247

@wyli

Description

@wyli
2023-03-27T17:04:09.2195753Z Traceback (most recent call last):
2023-03-27T17:04:09.2197307Z   File "/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py", line 214, in <module>
2023-03-27T17:04:09.2198657Z     fire.Fire(DummyRunnerSegResNet2D)
2023-03-27T17:04:09.2201046Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2023-03-27T17:04:09.2202731Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-03-27T17:04:09.2204431Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
2023-03-27T17:04:09.2205266Z     component, remaining_args = _CallAndUpdateTrace(
2023-03-27T17:04:09.2206396Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-03-27T17:04:09.2207182Z     component = fn(*varargs, **kwargs)
2023-03-27T17:04:09.2208279Z   File "/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py", line 41, in __init__
2023-03-27T17:04:09.2209772Z     torch.cuda.set_device(self.device)
2023-03-27T17:04:09.2210963Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 350, in set_device
2023-03-27T17:04:09.2211743Z     torch._C._cuda_setDevice(device)
2023-03-27T17:04:09.2212408Z RuntimeError: CUDA error: invalid device ordinal
2023-03-27T17:04:09.2213351Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2023-03-27T17:04:09.2214309Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2023-03-27T17:04:09.2215560Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

to reproduce this issue with a multi-gpu node:

export CUDA_VISBILE_DEVICES=1  # any value that doesn't include GPU 0
pip install -r requirements-dev.txt
python -m tests.test_integration_gpu_customization

more logs:
https://github.com/Project-MONAI/MONAI/actions/runs/4530824415/jobs/7980234392

Details
2023-03-27T17:03:12.9919532Z 2023-03-27 17:03:12,991 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpdckds8_9/work_dir/segresnet_0/scripts/train.py run --config_file='/tmp/tmpdckds8_9/work_dir/segresnet_0/configs/hyper_parameters.yaml' --num_images_per_batch=2 --num_epochs=2 --num_epochs_per_validation=1 --num_warmup_epochs=1 --use_pretrain=False --pretrained_path=
2023-03-27T17:03:15.0602131Z WARNING:torch.distributed.run:
2023-03-27T17:03:15.0602838Z *****************************************
2023-03-27T17:03:15.0604045Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2023-03-27T17:03:15.0605563Z *****************************************
2023-03-27T17:03:21.1597153Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
2023-03-27T17:03:21.1598317Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
2023-03-27T17:03:21.1600158Z INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-03-27T17:03:21.1601670Z Distributed: initializing multi-gpu env:// process group {'world_size': 2, 'rank': 0}
2023-03-27T17:03:21.1604317Z Segmenter 0 /tmp/tmpdckds8_9/work_dir/segresnet_0/configs/hyper_parameters.yaml {'num_images_per_batch': 2, 'num_epochs': 2, 'num_epochs_per_validation': 1, 'num_warmup_epochs': 1, 'use_pretrain': False, 'pretrained_path': '', 'mgpu': {'world_size': 2, 'rank': 0}}
2023-03-27T17:03:21.1702586Z INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-03-27T17:03:21.1704025Z Distributed: initializing multi-gpu env:// process group {'world_size': 2, 'rank': 1}
2023-03-27T17:03:21.1894595Z _meta_: {}
2023-03-27T17:03:21.1895143Z acc: null
2023-03-27T17:03:21.1895584Z amp: true
2023-03-27T17:03:21.1896044Z batch_size: 2
2023-03-27T17:03:21.1896674Z bundle_root: /tmp/tmpdckds8_9/work_dir/segresnet_0
2023-03-27T17:03:21.1899375Z cache_rate: null
2023-03-27T17:03:21.1899997Z ckpt_path: /tmp/tmpdckds8_9/work_dir/segresnet_0/model
2023-03-27T17:03:21.1900608Z ckpt_save: true
2023-03-27T17:03:21.1901114Z class_index: null
2023-03-27T17:03:21.1901569Z class_names:
2023-03-27T17:03:21.1902997Z - label_class
2023-03-27T17:03:21.1903520Z crop_mode: rand
2023-03-27T17:03:21.1904001Z crop_ratios: null
2023-03-27T17:03:21.1904472Z cuda: true
2023-03-27T17:03:21.1905028Z data_file_base_dir: /tmp/tmpdckds8_9/dataroot
2023-03-27T17:03:21.1905747Z data_list_file_path: /tmp/tmpdckds8_9/work_dir/sim_input.json
2023-03-27T17:03:21.1907651Z determ: false
2023-03-27T17:03:21.1908137Z extra_modalities: {}
2023-03-27T17:03:21.1908647Z finetune:
2023-03-27T17:03:21.1909254Z   ckpt_name: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt
2023-03-27T17:03:21.1909885Z   enabled: false
2023-03-27T17:03:21.1910337Z fold: 0
2023-03-27T17:03:21.1910737Z image_size:
2023-03-27T17:03:21.1911264Z - 24
2023-03-27T17:03:21.1911725Z - 24
2023-03-27T17:03:21.1912165Z - 24
2023-03-27T17:03:21.1912552Z infer:
2023-03-27T17:03:21.1913159Z   ckpt_name: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt
2023-03-27T17:03:21.1913780Z   data_list_key: testing
2023-03-27T17:03:21.1914272Z   enabled: false
2023-03-27T17:03:21.1914956Z   output_path: /tmp/tmpdckds8_9/work_dir/segresnet_0/prediction_testing
2023-03-27T17:03:21.1915792Z input_channels: 1
2023-03-27T17:03:21.1916280Z intensity_bounds:
2023-03-27T17:03:21.1916863Z - 0.6477540681759516
2023-03-27T17:03:21.1917353Z - 1.0
2023-03-27T17:03:21.1917811Z learning_rate: 0.0002
2023-03-27T17:03:21.1918294Z loss:
2023-03-27T17:03:21.1918727Z   _target_: DiceCELoss
2023-03-27T17:03:21.1919261Z   include_background: true
2023-03-27T17:03:21.1919845Z   sigmoid: $@sigmoid
2023-03-27T17:03:21.1920454Z   smooth_dr: 1.0e-05
2023-03-27T17:03:21.1920912Z   smooth_nr: 0
2023-03-27T17:03:21.1921406Z   softmax: $not @sigmoid
2023-03-27T17:03:21.1921914Z   squared_pred: true
2023-03-27T17:03:21.1922408Z   to_onehot_y: $not @sigmoid
2023-03-27T17:03:21.1922890Z mgpu:
2023-03-27T17:03:21.1923317Z   rank: 0
2023-03-27T17:03:21.1923733Z   world_size: 2
2023-03-27T17:03:21.1924196Z modality: mri
2023-03-27T17:03:21.1924673Z multigpu: false
2023-03-27T17:03:21.1925128Z name: sim_data
2023-03-27T17:03:21.1925572Z network:
2023-03-27T17:03:21.1926047Z   _target_: SegResNetDS
2023-03-27T17:03:21.1926522Z   blocks_down:
2023-03-27T17:03:21.1927030Z   - 1
2023-03-27T17:03:21.1927496Z   - 2
2023-03-27T17:03:21.1927943Z   - 2
2023-03-27T17:03:21.1928405Z   - 4
2023-03-27T17:03:21.1928861Z   - 4
2023-03-27T17:03:21.1929414Z   dsdepth: 4
2023-03-27T17:03:21.1930068Z   in_channels: '@input_channels'
2023-03-27T17:03:21.1930589Z   init_filters: 32
2023-03-27T17:03:21.1931031Z   norm: BATCH
2023-03-27T17:03:21.1931662Z   out_channels: '@output_classes'
2023-03-27T17:03:21.1932204Z normalize_mode: meanstd
2023-03-27T17:03:21.1932669Z num_epochs: 2
2023-03-27T17:03:21.1933152Z num_epochs_per_saving: 1
2023-03-27T17:03:21.1933688Z num_epochs_per_validation: 1
2023-03-27T17:03:21.1934205Z num_images_per_batch: 2
2023-03-27T17:03:21.1934707Z num_warmup_epochs: 1
2023-03-27T17:03:21.1935186Z num_workers: 4
2023-03-27T17:03:21.1935620Z optimizer:
2023-03-27T17:03:21.1936244Z   _target_: torch.optim.AdamW
2023-03-27T17:03:21.1936917Z   lr: '@learning_rate'
2023-03-27T17:03:21.1937661Z   weight_decay: 1.0e-05
2023-03-27T17:03:21.1938186Z output_classes: 2
2023-03-27T17:03:21.1938725Z pretrained_ckpt_name: null
2023-03-27T17:03:21.1939303Z pretrained_path: ''
2023-03-27T17:03:21.1939795Z quick: false
2023-03-27T17:03:21.1940247Z rank: 0
2023-03-27T17:03:21.1940681Z resample: false
2023-03-27T17:03:21.1941191Z resample_resolution:
2023-03-27T17:03:21.1941743Z - 1.0
2023-03-27T17:03:21.1942202Z - 1.0
2023-03-27T17:03:21.1942685Z - 1.0
2023-03-27T17:03:21.1943100Z roi_size:
2023-03-27T17:03:21.1943567Z - 32
2023-03-27T17:03:21.1944043Z - 32
2023-03-27T17:03:21.1944511Z - 32
2023-03-27T17:03:21.1944913Z sigmoid: false
2023-03-27T17:03:21.1945398Z spacing_lower:
2023-03-27T17:03:21.1945915Z - 1.0
2023-03-27T17:03:21.1946367Z - 1.0
2023-03-27T17:03:21.1946840Z - 1.0
2023-03-27T17:03:21.1947278Z spacing_upper:
2023-03-27T17:03:21.1947768Z - 1.0
2023-03-27T17:03:21.1948240Z - 1.0
2023-03-27T17:03:21.1948712Z - 1.0
2023-03-27T17:03:21.1949248Z task: segmentation
2023-03-27T17:03:21.1949766Z use_pretrain: false
2023-03-27T17:03:21.1950258Z validate:
2023-03-27T17:03:21.1950879Z   ckpt_name: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt
2023-03-27T17:03:21.1951508Z   enabled: false
2023-03-27T17:03:21.1951970Z   invert: true
2023-03-27T17:03:21.1952631Z   output_path: /tmp/tmpdckds8_9/work_dir/segresnet_0/prediction_validation
2023-03-27T17:03:21.1953293Z   save_mask: false
2023-03-27T17:03:21.1953768Z warmup_epochs: 13
2023-03-27T17:03:21.1954059Z 
2023-03-27T17:03:23.9846279Z Total parameters count 87164200 distributed True
2023-03-27T17:03:23.9976566Z monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
2023-03-27T17:03:23.9977865Z Segmenter train called
2023-03-27T17:03:23.9978471Z train_files files 8 validation files 4
2023-03-27T17:03:23.9979803Z Calculating cache required 0GB, available RAM 722GB given avg image size [24, 24, 24].
2023-03-27T17:03:23.9980930Z Caching full dataset in RAM
2023-03-27T17:03:23.9982143Z monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
2023-03-27T17:03:24.1386921Z Writing Tensorboard logs to  /tmp/tmpdckds8_9/work_dir/segresnet_0/model
2023-03-27T17:03:27.3459134Z Epoch 0/2 0/2 loss: 2.3806 acc [ 0.182] time 3.10s
2023-03-27T17:03:27.3501919Z INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
2023-03-27T17:03:27.3503085Z INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
2023-03-27T17:03:27.5169364Z Epoch 0/2 1/2 loss: 2.2985 acc [ 0.166] time 0.17s
2023-03-27T17:03:27.5181529Z Final training  0/1 loss: 2.2985 acc_avg: 0.1660 acc [ 0.166] time 3.28s
2023-03-27T17:03:28.4635734Z Val 0/2 0/2 loss: 1.0686 acc [ 0.369] time 0.94s
2023-03-27T17:03:28.4778698Z Val 0/2 1/2 loss: 1.0820 acc [ 0.351] time 0.01s
2023-03-27T17:03:28.4784515Z Final validation  0/1 loss: 1.0820 acc_avg: 0.3506 acc [ 0.351] time 0.96s
2023-03-27T17:03:28.4787965Z New best metric (-1.000000 --> 0.350615). 
2023-03-27T17:03:28.8740366Z Saving checkpoint process: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt {'epoch': 0, 'best_metric': 0.35061508417129517} save_time 0.39s
2023-03-27T17:03:28.8759397Z Progress:  best_avg_dice_score_epoch: 0, best_avg_dice_score: 0.35061508417129517, save_time: 0.39480137825012207, train_time: 3.28s, validation_time: 0.96s, epoch_time: 4.24s, model: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt, date: 2023-03-27 17:03:28
2023-03-27T17:03:29.5723235Z Epoch 1/2 0/2 loss: 2.0282 acc [ 0.212] time 0.46s
2023-03-27T17:03:29.7207985Z Epoch 1/2 1/2 loss: 1.7398 acc [ 0.356] time 0.15s
2023-03-27T17:03:29.7216048Z Final training  1/1 loss: 1.7398 acc_avg: 0.3565 acc [ 0.356] time 0.61s
2023-03-27T17:03:29.8852356Z Val 1/2 0/2 loss: 1.0131 acc [ 0.493] time 0.16s
2023-03-27T17:03:29.9452818Z Val 1/2 1/2 loss: 1.0195 acc [ 0.441] time 0.06s
2023-03-27T17:03:29.9455974Z Final validation  1/1 loss: 1.0195 acc_avg: 0.4407 acc [ 0.441] time 0.22s
2023-03-27T17:03:29.9458553Z New best metric (0.350615 --> 0.440705). 
2023-03-27T17:03:31.2291580Z Saving checkpoint process: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt {'epoch': 1, 'best_metric': 0.44070500135421753} save_time 1.28s
2023-03-27T17:03:31.2304615Z Progress:  best_avg_dice_score_epoch: 1, best_avg_dice_score: 0.44070500135421753, save_time: 1.2826282978057861, train_time: 0.61s, validation_time: 0.22s, epoch_time: 0.84s, model: /tmp/tmpdckds8_9/work_dir/segresnet_0/model/model.pt, date: 2023-03-27 17:03:31
2023-03-27T17:03:32.5644741Z train completed, best_metric: 0.4407 at epoch: 1
2023-03-27T17:03:35.4396650Z 2023-03-27 17:03:35,438 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpdckds8_9/work_dir/swinunetr_0/scripts/train.py run --config_file='/tmp/tmpdckds8_9/work_dir/swinunetr_0/configs/transforms_infer.yaml','/tmp/tmpdckds8_9/work_dir/swinunetr_0/configs/transforms_validate.yaml','/tmp/tmpdckds8_9/work_dir/swinunetr_0/configs/transforms_train.yaml','/tmp/tmpdckds8_9/work_dir/swinunetr_0/configs/network.yaml','/tmp/tmpdckds8_9/work_dir/swinunetr_0/configs/hyper_parameters.yaml' --num_images_per_batch=2 --num_epochs=2 --num_epochs_per_validation=1 --num_warmup_epochs=1 --use_pretrain=False --pretrained_path=
2023-03-27T17:03:37.4682659Z WARNING:torch.distributed.run:
2023-03-27T17:03:37.4683377Z *****************************************
2023-03-27T17:03:37.4684582Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2023-03-27T17:03:37.4685678Z *****************************************
2023-03-27T17:03:43.5129090Z monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
2023-03-27T17:03:43.5149635Z monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
2023-03-27T17:03:43.5497208Z [info] number of GPUs: 2
2023-03-27T17:03:43.5516847Z [info] number of GPUs: 2
2023-03-27T17:03:43.6082763Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
2023-03-27T17:03:43.6083928Z INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
2023-03-27T17:03:43.6086345Z INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-03-27T17:03:43.6087296Z [info] world_size: 2
2023-03-27T17:03:43.6098161Z train_files: 4
2023-03-27T17:03:43.6098650Z val_files: 2
2023-03-27T17:03:43.6187896Z INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-03-27T17:03:43.6188863Z [info] world_size: 2
2023-03-27T17:03:43.6200798Z train_files: 4
2023-03-27T17:03:43.6201262Z val_files: 2
2023-03-27T17:03:45.8820750Z num_epochs 2
2023-03-27T17:03:45.8821384Z num_epochs_per_validation 1
2023-03-27T17:03:46.4185570Z [info] training from scratch
2023-03-27T17:03:46.4187048Z [info] training from scratch
2023-03-27T17:03:46.4187705Z [info] amp enabled
2023-03-27T17:03:46.4197220Z ----------
2023-03-27T17:03:46.4197700Z epoch 1/2
2023-03-27T17:03:46.4198209Z learning rate is set to 0.0001
2023-03-27T17:03:48.9992355Z [2023-03-27 17:03:48] 1/2, train_loss: 1.0422
2023-03-27T17:03:49.1269482Z INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
2023-03-27T17:03:49.1325574Z INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
2023-03-27T17:03:49.5422206Z [2023-03-27 17:03:49] 2/2, train_loss: 0.9727
2023-03-27T17:03:49.6075368Z epoch 1 average loss: 1.0141, best mean dice: -1.0000 at epoch -1
2023-03-27T17:03:50.0334775Z 1 / 2 metatensor([[0.0667]], device='cuda:0')
2023-03-27T17:03:50.0823518Z 1 / 2 metatensor([[0.0742]], device='cuda:1')
2023-03-27T17:03:50.2180221Z 2 / 2 metatensor([[0.0859]], device='cuda:0')
2023-03-27T17:03:50.2666680Z 2 / 2 metatensor([[0.0485]], device='cuda:1')
2023-03-27T17:03:50.3461768Z evaluation metric - class 1: 0.06882839649915695
2023-03-27T17:03:50.3462435Z avg_metric 0.06882839649915695
2023-03-27T17:03:50.7171271Z saved new best metric model
2023-03-27T17:03:50.7177498Z current epoch: 1 current mean dice: 0.0688 best mean dice: 0.0688 at epoch 1
2023-03-27T17:03:50.7181238Z ----------
2023-03-27T17:03:50.7181672Z epoch 2/2
2023-03-27T17:03:50.7182327Z learning rate is set to 5e-05
2023-03-27T17:03:51.5044249Z [2023-03-27 17:03:51] 1/2, train_loss: 0.9273
2023-03-27T17:03:51.9755998Z [2023-03-27 17:03:51] 2/2, train_loss: 0.8876
2023-03-27T17:03:52.0625683Z epoch 2 average loss: 0.9049, best mean dice: 0.0688 at epoch 1
2023-03-27T17:03:52.4428934Z 1 / 2 metatensor([[0.0358]], device='cuda:0')
2023-03-27T17:03:52.4810911Z 1 / 2 metatensor([[0.0426]], device='cuda:1')
2023-03-27T17:03:52.5743129Z 2 / 2 metatensor([[0.0463]], device='cuda:0')
2023-03-27T17:03:52.6477638Z 2 / 2 metatensor([[0.0360]], device='cuda:1')
2023-03-27T17:03:52.7416845Z evaluation metric - class 1: 0.04017695039510727
2023-03-27T17:03:52.7417744Z avg_metric 0.04017695039510727
2023-03-27T17:03:52.7418457Z current epoch: 2 current mean dice: 0.0402 best mean dice: 0.0688 at epoch 1
2023-03-27T17:03:52.7429372Z train completed, best_metric: 0.0688 at epoch: 1
2023-03-27T17:03:52.7611925Z 0.06882839649915695
2023-03-27T17:03:52.7823157Z -1
2023-03-27T17:03:59.1118142Z 0
2023-03-27T17:03:59.1119152Z [info] checkpoint /tmp/tmpdckds8_9/work_dir/swinunetr_0/model_fold0/best_metric_model.pt loaded
2023-03-27T17:03:59.1120979Z 2023-03-27 17:03:59,111 INFO image_writer.py:197 - writing: /tmp/tmpdckds8_9/work_dir/swinunetr_0/prediction_testing/val_001.fake/val_001.fake_seg.nii.gz
2023-03-27T17:04:00.2021146Z 1
2023-03-27T17:04:00.2022710Z [info] checkpoint /tmp/tmpdckds8_9/work_dir/swinunetr_0/model_fold0/best_metric_model.pt loaded
2023-03-27T17:04:00.2024773Z 2023-03-27 17:04:00,201 INFO image_writer.py:197 - writing: /tmp/tmpdckds8_9/work_dir/swinunetr_0/prediction_testing/val_002.fake/val_002.fake_seg.nii.gz
2023-03-27T17:04:00.2054206Z 2023-03-27 17:04:00,204 - INFO - Auto3Dseg picked the following networks to ensemble:
2023-03-27T17:04:00.2055560Z 2023-03-27 17:04:00,204 - INFO - swinunetr_0
2023-03-27T17:04:00.2079489Z 2023-03-27 17:04:00,207 INFO image_writer.py:197 - writing: /tmp/tmpdckds8_9/work_dir/ensemble_output/val_001.fake/val_001.fake_ensemble.nii.gz
2023-03-27T17:04:00.2121587Z 2023-03-27 17:04:00,211 INFO image_writer.py:197 - writing: /tmp/tmpdckds8_9/work_dir/ensemble_output/val_002.fake/val_002.fake_ensemble.nii.gz
2023-03-27T17:04:00.2139946Z 2023-03-27 17:04:00,213 - INFO - Auto3Dseg ensemble prediction outputs are saved in /tmp/tmpdckds8_9/work_dir/ensemble_output.
2023-03-27T17:04:00.2141293Z 2023-03-27 17:04:00,213 - INFO - Auto3Dseg pipeline is complete successfully.
2023-03-27T17:04:00.4088661Z 
2023-03-27T17:04:00.4090790Z monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
2023-03-27T17:04:00.4682064Z 2023-03-27 17:04:00,467 - INFO - AutoRunner using work directory /tmp/tmpz0_vc3a7/work_dir
2023-03-27T17:04:00.4703259Z 2023-03-27 17:04:00,469 - INFO - Loading input config /tmp/tmpz0_vc3a7/data_src_cfg.yaml
2023-03-27T17:04:00.4705294Z 2023-03-27 17:04:00,469 - INFO - Datalist was copied to work_dir: /tmp/tmpz0_vc3a7/work_dir/sim_input.json
2023-03-27T17:04:00.4706990Z 2023-03-27 17:04:00,470 - INFO - Setting num_fold 3 based on the input datalist /tmp/tmpz0_vc3a7/work_dir/sim_input.json.
2023-03-27T17:04:00.4738312Z 2023-03-27 17:04:00,473 - INFO - The output_dir is not specified. /tmp/tmpz0_vc3a7/work_dir/ensemble_output will be used to save ensemble predictions
2023-03-27T17:04:00.4740567Z 2023-03-27 17:04:00,473 - INFO - Directory /tmp/tmpz0_vc3a7/work_dir/ensemble_output is created to save ensemble predictions
2023-03-27T17:04:00.4741670Z 2023-03-27 17:04:00,473 - INFO - Running data analysis...
2023-03-27T17:04:00.4808712Z .
2023-03-27T17:04:01.3134206Z   0%|          | 0/12 [00:00<?, ?it/s]
2023-03-27T17:04:01.4904471Z   8%|▊         | 1/12 [00:00<00:09,  1.20it/s]
2023-03-27T17:04:01.5983977Z  25%|██▌       | 3/12 [00:01<00:02,  3.55it/s]
2023-03-27T17:04:01.9313410Z  67%|██████▋   | 8/12 [00:01<00:00, 10.77it/s]
2023-03-27T17:04:02.3207178Z 2023-03-27 17:04:02,319 - INFO - BundleGen from https://github.com/Project-MONAI/research-contributions/releases/download/algo_templates/4af80e1.tar.gz
2023-03-27T17:04:02.3211753Z 100%|██████████| 12/12 [00:01<00:00,  8.28it/s]
2023-03-27T17:04:02.3212377Z 
2023-03-27T17:04:02.6904543Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2023-03-27T17:04:02.6999337Z algo_templates.tar.gz:  13%|█▎        | 8.00k/60.1k [00:00<00:02, 22.2kB/s]
2023-03-27T17:04:02.7003697Z 2023-03-27 17:04:02,699 - INFO - Downloaded: /tmp/tmpormfoxv4/algo_templates.tar.gz
2023-03-27T17:04:02.7005506Z 2023-03-27 17:04:02,700 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpormfoxv4/algo_templates.tar.gz.
2023-03-27T17:04:02.7007715Z 2023-03-27 17:04:02,700 - INFO - Writing into directory: /tmp/tmpz0_vc3a7/work_dir.
2023-03-27T17:04:03.3132842Z algo_templates.tar.gz: 64.0kB [00:00, 173kB/s]                             
2023-03-27T17:04:03.3134954Z �[32m[I 2023-03-27 17:04:03,311]�[0m A new study created in memory with name: no-name-78ffeb62-2609-4639-a2a4-e80ac136100c�[0m
2023-03-27T17:04:09.2195753Z Traceback (most recent call last):
2023-03-27T17:04:09.2197307Z   File "/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py", line 214, in <module>
2023-03-27T17:04:09.2198657Z     fire.Fire(DummyRunnerSegResNet2D)
2023-03-27T17:04:09.2201046Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2023-03-27T17:04:09.2202731Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-03-27T17:04:09.2204431Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
2023-03-27T17:04:09.2205266Z     component, remaining_args = _CallAndUpdateTrace(
2023-03-27T17:04:09.2206396Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-03-27T17:04:09.2207182Z     component = fn(*varargs, **kwargs)
2023-03-27T17:04:09.2208279Z   File "/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py", line 41, in __init__
2023-03-27T17:04:09.2209772Z     torch.cuda.set_device(self.device)
2023-03-27T17:04:09.2210963Z   File "/opt/conda/lib/python3.8/site-packages/torch/cuda/__init__.py", line 350, in set_device
2023-03-27T17:04:09.2211743Z     torch._C._cuda_setDevice(device)
2023-03-27T17:04:09.2212408Z RuntimeError: CUDA error: invalid device ordinal
2023-03-27T17:04:09.2213351Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2023-03-27T17:04:09.2214309Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2023-03-27T17:04:09.2215560Z Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
2023-03-27T17:04:09.2216044Z 
2023-03-27T17:04:09.9935321Z �[33m[W 2023-03-27 17:04:09,988]�[0m Trial 0 failed with parameters: {'num_images_per_batch': 1, 'num_sw_batch_size': 2, 'validation_data_device': 'gpu'} because of the following error: CalledProcessError(1, ['python', '/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py', '--output_path', '/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0', '--data_stats_file', '/tmp/tmpz0_vc3a7/work_dir/datastats.yaml', '--device_id', '2', 'run', '--num_images_per_batch', '1', '--num_sw_batch_size', '2', '--validation_data_device', 'gpu']).�[0m
2023-03-27T17:04:09.9937530Z Traceback (most recent call last):
2023-03-27T17:04:09.9938714Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
2023-03-27T17:04:09.9939504Z     value_or_values = func(trial)
2023-03-27T17:04:09.9940330Z   File "/tmp/tmp5tt7eirs/work_dir/algorithm_templates/segresnet2d/scripts/algo.py", line 283, in objective
2023-03-27T17:04:09.9941265Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2023-03-27T17:04:09.9942036Z     raise CalledProcessError(retcode, process.args,
2023-03-27T17:04:09.9944615Z subprocess.CalledProcessError: Command '['python', '/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py', '--output_path', '/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0', '--data_stats_file', '/tmp/tmpz0_vc3a7/work_dir/datastats.yaml', '--device_id', '2', 'run', '--num_images_per_batch', '1', '--num_sw_batch_size', '2', '--validation_data_device', 'gpu']' returned non-zero exit status 1.
2023-03-27T17:04:09.9946751Z �[33m[W 2023-03-27 17:04:09,992]�[0m Trial 0 failed with value None.�[0m
2023-03-27T17:04:10.0707467Z [info] gpu device 2 with minimum memory
2023-03-27T17:04:10.0709370Z 2023-03-27 17:04:10,070 - INFO - AutoRunner using work directory /tmp/tmpio0f5ogl/work_dir
2023-03-27T17:04:10.0733268Z 2023-03-27 17:04:10,072 - INFO - Loading input config /tmp/tmpio0f5ogl/data_src_cfg.yaml
2023-03-27T17:04:10.0735118Z 2023-03-27 17:04:10,072 - INFO - Datalist was copied to work_dir: /tmp/tmpio0f5ogl/work_dir/sim_input.json
2023-03-27T17:04:10.0736486Z 2023-03-27 17:04:10,073 - INFO - Setting num_fold 3 based on the input datalist /tmp/tmpio0f5ogl/work_dir/sim_input.json.
2023-03-27T17:04:10.0768577Z 2023-03-27 17:04:10,076 - INFO - The output_dir is not specified. /tmp/tmpio0f5ogl/work_dir/ensemble_output will be used to save ensemble predictions
2023-03-27T17:04:10.0770158Z 2023-03-27 17:04:10,076 - INFO - Directory /tmp/tmpio0f5ogl/work_dir/ensemble_output is created to save ensemble predictions
2023-03-27T17:04:10.0772075Z 2023-03-27 17:04:10,076 - INFO - Running data analysis...
2023-03-27T17:04:10.0848450Z E
2023-03-27T17:04:10.9372897Z   0%|          | 0/12 [00:00<?, ?it/s]
2023-03-27T17:04:11.1309654Z   8%|▊         | 1/12 [00:00<00:09,  1.17it/s]
2023-03-27T17:04:11.2336242Z  25%|██▌       | 3/12 [00:01<00:02,  3.42it/s]
2023-03-27T17:04:11.5721455Z  67%|██████▋   | 8/12 [00:01<00:00, 10.51it/s]
2023-03-27T17:04:11.9303508Z 2023-03-27 17:04:11,929 - INFO - BundleGen from https://github.com/Project-MONAI/research-contributions/releases/download/algo_templates/4af80e1.tar.gz
2023-03-27T17:04:11.9308086Z 100%|██████████| 12/12 [00:01<00:00,  8.07it/s]
2023-03-27T17:04:11.9308784Z 
2023-03-27T17:04:12.3163092Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2023-03-27T17:04:12.3267844Z algo_templates.tar.gz:  13%|█▎        | 8.00k/60.1k [00:00<00:02, 21.3kB/s]
2023-03-27T17:04:12.3273056Z 2023-03-27 17:04:12,326 - INFO - Downloaded: /tmp/tmpyjza4sup/algo_templates.tar.gz
2023-03-27T17:04:12.3274907Z 2023-03-27 17:04:12,327 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpyjza4sup/algo_templates.tar.gz.
2023-03-27T17:04:12.3277211Z 2023-03-27 17:04:12,327 - INFO - Writing into directory: /tmp/tmpio0f5ogl/work_dir.
2023-03-27T17:04:12.7101520Z 2023-03-27 17:04:12,709 - INFO - /tmp/tmpio0f5ogl/work_dir/segresnet2d_0
2023-03-27T17:04:12.9306747Z 2023-03-27 17:04:12,930 - INFO - /tmp/tmpio0f5ogl/work_dir/dints_0
2023-03-27T17:04:13.0934667Z 2023-03-27 17:04:13,092 - INFO - /tmp/tmpio0f5ogl/work_dir/swinunetr_0
2023-03-27T17:04:13.1907419Z Loaded self.data_list_file /tmp/tmpio0f5ogl/work_dir/input.yaml
2023-03-27T17:04:13.1909179Z 2023-03-27 17:04:13,190 - INFO - /tmp/tmpio0f5ogl/work_dir/segresnet_0
2023-03-27T17:04:13.5166027Z 2023-03-27 17:04:13,515 - INFO - /tmp/tmpio0f5ogl/work_dir/dints_0_override
2023-03-27T17:04:13.5210100Z 2023-03-27 17:04:13,520 - INFO - AutoRunner HPO is in dry-run mode. Please manually launch: nnictl create --config /tmp/tmpio0f5ogl/work_dir/dints_0_nni_config.yaml --port 8088
2023-03-27T17:04:13.7856797Z 2023-03-27 17:04:13,784 - INFO - /tmp/tmpio0f5ogl/work_dir/segresnet2d_0_override
2023-03-27T17:04:13.7900335Z 2023-03-27 17:04:13,789 - INFO - AutoRunner HPO is in dry-run mode. Please manually launch: nnictl create --config /tmp/tmpio0f5ogl/work_dir/segresnet2d_0_nni_config.yaml --port 8088
2023-03-27T17:04:13.9263550Z 2023-03-27 17:04:13,925 - INFO - /tmp/tmpio0f5ogl/work_dir/segresnet_0_override
2023-03-27T17:04:13.9511651Z 2023-03-27 17:04:13,950 - INFO - AutoRunner HPO is in dry-run mode. Please manually launch: nnictl create --config /tmp/tmpio0f5ogl/work_dir/segresnet_0_nni_config.yaml --port 8088
2023-03-27T17:04:14.1755263Z 2023-03-27 17:04:14,174 - INFO - /tmp/tmpio0f5ogl/work_dir/swinunetr_0_override
2023-03-27T17:04:14.1798989Z 2023-03-27 17:04:14,179 - INFO - AutoRunner HPO is in dry-run mode. Please manually launch: nnictl create --config /tmp/tmpio0f5ogl/work_dir/swinunetr_0_nni_config.yaml --port 8088
2023-03-27T17:04:14.1807012Z 2023-03-27 17:04:14,180 - INFO - Auto3Dseg pipeline is complete successfully.
2023-03-27T17:04:14.1864850Z algo_templates.tar.gz: 64.0kB [00:00, 166kB/s]                             
2023-03-27T17:04:14.2957241Z .
2023-03-27T17:04:14.2958277Z ======================================================================
2023-03-27T17:04:14.2959109Z ERROR: test_autorunner_gpu_customization (__main__.TestAutoRunner)
2023-03-27T17:04:14.2960324Z ----------------------------------------------------------------------
2023-03-27T17:04:14.2961045Z Traceback (most recent call last):
2023-03-27T17:04:14.2961876Z   File "tests/test_integration_autorunner.py", line 149, in test_autorunner_gpu_customization
2023-03-27T17:04:14.2963097Z     runner.run()
2023-03-27T17:04:14.2963936Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/auto_runner.py", line 742, in run
2023-03-27T17:04:14.2965330Z     bundle_generator.generate(
2023-03-27T17:04:14.2966671Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 515, in generate
2023-03-27T17:04:14.2967983Z     gen_algo.export_to_disk(
2023-03-27T17:04:14.2969339Z   File "/__w/MONAI/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 141, in export_to_disk
2023-03-27T17:04:14.2970695Z     self.fill_records = self.fill_template_config(self.data_stats_files, self.output_path, **kwargs)
2023-03-27T17:04:14.2971774Z   File "/tmp/tmp5tt7eirs/work_dir/algorithm_templates/segresnet2d/scripts/algo.py", line 208, in fill_template_config
2023-03-27T17:04:14.2972908Z   File "/tmp/tmp5tt7eirs/work_dir/algorithm_templates/segresnet2d/scripts/algo.py", line 310, in customize_param_for_gpu
2023-03-27T17:04:14.2974295Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/study.py", line 425, in optimize
2023-03-27T17:04:14.2975007Z     _optimize(
2023-03-27T17:04:14.2976024Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 66, in _optimize
2023-03-27T17:04:14.2976768Z     _optimize_sequential(
2023-03-27T17:04:14.2978054Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 163, in _optimize_sequential
2023-03-27T17:04:14.2978929Z     frozen_trial = _run_trial(study, func, catch)
2023-03-27T17:04:14.2980032Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 251, in _run_trial
2023-03-27T17:04:14.2980760Z     raise func_err
2023-03-27T17:04:14.2981778Z   File "/opt/conda/lib/python3.8/site-packages/optuna/study/_optimize.py", line 200, in _run_trial
2023-03-27T17:04:14.2982696Z     value_or_values = func(trial)
2023-03-27T17:04:14.2983518Z   File "/tmp/tmp5tt7eirs/work_dir/algorithm_templates/segresnet2d/scripts/algo.py", line 283, in objective
2023-03-27T17:04:14.2984421Z   File "/opt/conda/lib/python3.8/subprocess.py", line 516, in run
2023-03-27T17:04:14.2985197Z     raise CalledProcessError(retcode, process.args,
2023-03-27T17:04:14.2987921Z subprocess.CalledProcessError: Command '['python', '/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0/scripts/dummy_runner.py', '--output_path', '/tmp/tmpz0_vc3a7/work_dir/segresnet2d_0', '--data_stats_file', '/tmp/tmpz0_vc3a7/work_dir/datastats.yaml', '--device_id', '2', 'run', '--num_images_per_batch', '1', '--num_sw_batch_size', '2', '--validation_data_device', 'gpu']' returned non-zero exit status 1.
2023-03-27T17:04:14.2989347Z 
2023-03-27T17:04:14.2989936Z ----------------------------------------------------------------------
2023-03-27T17:04:14.2990635Z Ran 4 tests in 331.324s
2023-03-27T17:04:14.2990955Z 
2023-03-27T17:04:14.2991151Z FAILED (errors=1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions