Skip to content

reduce testing memory footprint tests.test_auto3dseg_ensemble #5816

@wyli

Description

@wyli
2023-01-05T20:52:15.2725708Z Starting test: test_ensemble (tests.test_auto3dseg_ensemble.TestEnsembleBuilder)...
2023-01-05T20:52:15.2726298Z 
2023-01-05T20:52:15.2726906Z Cannot find dataset directory: /__w/MONAI/MONAI/tests/testing_data/Task04_Hippocampus, please use download=True to download it.
2023-01-05T20:52:15.7054175Z   0%|          | 0/12 [00:00<?, ?it/s]
2023-01-05T20:52:15.8174617Z   8%|▊         | 1/12 [00:00<00:04,  2.31it/s]
2023-01-05T20:52:16.1639649Z  58%|█████▊    | 7/12 [00:00<00:00, 16.13it/s]
2023-01-05T20:52:16.3553099Z 100%|██████████| 12/12 [00:00<00:00, 13.46it/s]
2023-01-05T20:52:16.3553538Z 
2023-01-05T20:52:16.8067140Z algo_templates.tar.gz: 0.00B [00:00, ?B/s]
2023-01-05T20:52:16.8131425Z algo_templates.tar.gz:  14%|█▍        | 8.00k/57.2k [00:00<00:02, 18.2kB/s]
2023-01-05T20:52:16.8138542Z 2023-01-05 20:52:16,813 - INFO - Downloaded: /tmp/tmp5bwbggqp/algo_templates.tar.gz
2023-01-05T20:52:16.8140126Z 2023-01-05 20:52:16,813 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmp5bwbggqp/algo_templates.tar.gz.
2023-01-05T20:52:16.8142376Z 2023-01-05 20:52:16,813 - INFO - Writing into directory: /tmp/tmpuse1f938/workdir.
2023-01-05T20:52:17.2229508Z 2023-01-05 20:52:17,222 - INFO - /tmp/tmpuse1f938/workdir/segresnet2d_0
2023-01-05T20:52:17.4908638Z 2023-01-05 20:52:17,490 - INFO - /tmp/tmpuse1f938/workdir/dints_0
2023-01-05T20:52:17.7268874Z 2023-01-05 20:52:17,726 - INFO - /tmp/tmpuse1f938/workdir/swinunetr_0
2023-01-05T20:52:17.9300995Z Loaded self.data_list_file /tmp/tmpuse1f938/workdir/data_src_cfg.yaml
2023-01-05T20:52:17.9302285Z 2023-01-05 20:52:17,929 - INFO - /tmp/tmpuse1f938/workdir/segresnet_0
2023-01-05T20:52:17.9437330Z 2023-01-05 20:52:17,942 - INFO - Launching: torchrun --nnodes=1 --nproc_per_node=2 /tmp/tmpuse1f938/workdir/segresnet2d_0/scripts/train.py run --config_file='/tmp/tmpuse1f938/workdir/segresnet2d_0/configs/transforms_infer.yaml','/tmp/tmpuse1f938/workdir/segresnet2d_0/configs/transforms_validate.yaml','/tmp/tmpuse1f938/workdir/segresnet2d_0/configs/transforms_train.yaml','/tmp/tmpuse1f938/workdir/segresnet2d_0/configs/network.yaml','/tmp/tmpuse1f938/workdir/segresnet2d_0/configs/hyper_parameters.yaml' --num_images_per_batch=2 --num_epochs=2 --num_epochs_per_validation=1 --num_warmup_epochs=1 --use_pretrain=False --pretrained_path=
2023-01-05T20:52:19.5234099Z algo_templates.tar.gz: 64.0kB [00:00, 143kB/s]                             WARNING:torch.distributed.run:
2023-01-05T20:52:19.5235389Z *****************************************
2023-01-05T20:52:19.5237149Z Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
2023-01-05T20:52:19.5238265Z *****************************************
2023-01-05T20:52:25.0119232Z [info] number of GPUs: 2
2023-01-05T20:52:25.0132020Z [info] number of GPUs: 2
2023-01-05T20:52:25.0333004Z 2023-01-05 20:52:25,032 - Added key: store_based_barrier_key:1 to store for rank: 1
2023-01-05T20:52:25.0554328Z 2023-01-05 20:52:25,054 - Added key: store_based_barrier_key:1 to store for rank: 0
2023-01-05T20:52:25.0555773Z 2023-01-05 20:52:25,055 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-01-05T20:52:25.0556586Z [info] world_size: 2
2023-01-05T20:52:25.0559525Z train_files: 4
2023-01-05T20:52:25.0560003Z val_files: 2
2023-01-05T20:52:25.0643106Z 2023-01-05 20:52:25,063 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-01-05T20:52:25.0643983Z [info] world_size: 2
2023-01-05T20:52:25.0652833Z train_files: 4
2023-01-05T20:52:25.0653347Z val_files: 2
2023-01-05T20:52:25.3589238Z Traceback (most recent call last):
2023-01-05T20:52:25.3590215Z   File "/tmp/tmpuse1f938/workdir/segresnet2d_0/scripts/train.py", line 513, in <module>
2023-01-05T20:52:25.3591042Z     fire.Fire()
2023-01-05T20:52:25.3592390Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
2023-01-05T20:52:25.3593413Z     component_trace = _Fire(component, args, parsed_flag_args, context, name)
2023-01-05T20:52:25.3594646Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
2023-01-05T20:52:25.3595572Z     component, remaining_args = _CallAndUpdateTrace(
2023-01-05T20:52:25.3596841Z   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
2023-01-05T20:52:25.3597702Z     component = fn(*varargs, **kwargs)
2023-01-05T20:52:25.3598572Z   File "/tmp/tmpuse1f938/workdir/segresnet2d_0/scripts/train.py", line 175, in run
2023-01-05T20:52:25.3599354Z     model = model.to(device)
2023-01-05T20:52:25.3600971Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 989, in to
2023-01-05T20:52:25.3601851Z     return self._apply(convert)
2023-01-05T20:52:25.3603019Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
2023-01-05T20:52:25.3603843Z     module._apply(fn)
2023-01-05T20:52:25.3604937Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
2023-01-05T20:52:25.3605760Z     module._apply(fn)
2023-01-05T20:52:25.3606873Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
2023-01-05T20:52:25.3607727Z     param_applied = fn(param)
2023-01-05T20:52:25.3608858Z   File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in convert
2023-01-05T20:52:25.3609979Z     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-01-05T20:52:25.3610861Z RuntimeError: CUDA error: out of memory
2023-01-05T20:52:25.3611856Z CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
2023-01-05T20:52:25.3612919Z For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
2023-01-05T20:52:26.9288777Z num_epochs 2
2023-01-05T20:52:26.9289459Z num_epochs_per_validation 1
2023-01-05T20:52:29.5527327Z WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 6388 closing signal SIGTERM
2023-01-05T20:52:29.7673710Z ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 6389) of binary: /opt/conda/bin/python
2023-01-05T20:52:29.7774402Z Traceback (most recent call last):
2023-01-05T20:52:29.7775188Z   File "/opt/conda/bin/torchrun", line 8, in <module>
2023-01-05T20:52:29.7775837Z     sys.exit(main())
2023-01-05T20:52:29.7777578Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
2023-01-05T20:52:29.7779019Z     return f(*args, **kwargs)
2023-01-05T20:52:29.7780143Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
2023-01-05T20:52:29.7780883Z     run(args)
2023-01-05T20:52:29.7781871Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
2023-01-05T20:52:29.7782641Z     elastic_launch(
2023-01-05T20:52:29.7783759Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
2023-01-05T20:52:29.7784693Z     return launch_agent(self._config, self._entrypoint, list(args))
2023-01-05T20:52:29.7785966Z   File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
2023-01-05T20:52:29.7786793Z     raise ChildFailedError(
2023-01-05T20:52:29.7787740Z torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
2023-01-05T20:52:29.7788730Z ============================================================
2023-01-05T20:52:29.7789459Z /tmp/tmpuse1f938/workdir/segresnet2d_0/scripts/train.py FAILED
2023-01-05T20:52:29.7790497Z ------------------------------------------------------------
2023-01-05T20:52:29.7791098Z Failures:
2023-01-05T20:52:29.7791581Z   <NO_OTHER_FAILURES>
2023-01-05T20:52:29.7792441Z ------------------------------------------------------------
2023-01-05T20:52:29.7793130Z Root Cause (first observed failure):
2023-01-05T20:52:29.7793670Z [0]:
2023-01-05T20:52:29.7794305Z   time      : 2023-01-05_20:52:29
2023-01-05T20:52:29.7794818Z   host      : b9f7c1309e47
2023-01-05T20:52:29.7795356Z   rank      : 1 (local_rank: 1)
2023-01-05T20:52:29.7795905Z   exitcode  : 1 (pid: 6389)
2023-01-05T20:52:29.7796417Z   error_file: <N/A>
2023-01-05T20:52:29.7797254Z   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
2023-01-05T20:52:29.7798090Z ============================================================
2023-01-05T20:52:30.0071039Z 
2023-01-05T20:52:30.0088436Z EFinished test: test_ensemble (tests.test_auto3dseg_ensemble.TestEnsembleBuilder) (14.8s)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions