-
Notifications
You must be signed in to change notification settings - Fork 382
Auto3dseg not support PyTorch 2.6+ #411
Copy link
Copy link
Description
root@yunliu-MS-7D31:/workspace/Code/MONAI# python -m unittest tests/integration/test_auto3dseg_ensemble.py
2025-03-17 11:37:24,327 - INFO - Found 2 GPUs for data analyzing!
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00, 1.04it/s]
2025-03-17 11:37:38,226 - INFO - Writing data stats to /tmp/tmpdxux7zb8/workdir/datastats.yaml.
2025-03-17 11:37:38,230 - INFO - Writing by-case data stats to /tmp/tmpdxux7zb8/workdir/datastats_by_case.yaml, this may take a while.
2025-03-17 11:37:38,256 - INFO - BundleGen from https://github.com/Project-MONAI/research-contributions/releases/download/algo_templates/c970bdf.tar.gz
algo_templates.tar.gz: 104kB [00:01, 66.5kB/s]
2025-03-17 11:37:41,308 - INFO - Downloaded: /tmp/tmpf52e075l/algo_templates.tar.gz
2025-03-17 11:37:41,308 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpf52e075l/algo_templates.tar.gz.
2025-03-17 11:37:41,308 - INFO - Writing into directory: /tmp/tmpdxux7zb8/workdir.
2025-03-17 11:37:41,415 - INFO - Generated:/tmp/tmpdxux7zb8/workdir/dints_0
2025-03-17 11:37:41,443 - INFO - Generated:/tmp/tmpdxux7zb8/workdir/segresnet_0
2025-03-17 11:37:41,456 - INFO - segresnet2d_0 is skipped! SegresNet2D is skipped due to median spacing of [1.0, 1.0, 1.0], which means the dataset is not highly anisotropic, e.g. spacing[2] < 3*(spacing[0] + spacing[1])/2) .
2025-03-17 11:37:41,573 - INFO - Generated:/tmp/tmpdxux7zb8/workdir/swinunetr_0
2025-03-17 11:37:41,577 - INFO - The keys num_warmup_epochs cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key num_warmup_epochs.
2025-03-17 11:37:41,577 - INFO - The keys use_pretrain cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key use_pretrain.
2025-03-17 11:37:41,577 - INFO - The keys pretrained_path cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key pretrained_path.
2025-03-17 11:37:41,577 - INFO - The keys determ cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key determ.
2025-03-17 11:37:41,578 - INFO - ['torchrun', '--nnodes', '1', '--nproc_per_node', '2', '/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py', 'run', "--config_file='/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_infer.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_train.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_validate.yaml'", '--training#num_images_per_batch=2', '--training#num_epochs=2', '--training#num_epochs_per_validation=1']
W0317 11:37:42.338000 5601 torch/distributed/run.py:792]
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] *****************************************
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] *****************************************
[rank1]:[W317 11:37:46.090068696 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]:[W317 11:37:46.240209156 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/monai/monai/bundle/config_item.py", line 374, in evaluate
[rank0]: return eval(value[len(self.prefix) :], globals_, locals)
[rank0]: File "<string>", line 1, in <module>
[rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1494, in load
[rank0]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank0]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank0]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank0]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank0]: WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_reconstruct])` or the `torch.serialization.safe_globals([_reconstruct])` context manager to allowlist this global if you trust this class/function.
[rank0]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 1001, in <module>
[rank0]: fire.Fire()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 135, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 468, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 416, in run
[rank0]: model = parser.get_parsed_content("training_network#network")
[rank0]: File "/opt/monai/monai/bundle/config_parser.py", line 290, in get_parsed_content
[rank0]: return self.ref_resolver.get_resolved_content(id=id, **kwargs)
[rank0]: File "/opt/monai/monai/bundle/reference_resolver.py", line 193, in get_resolved_content
[rank0]: return self._resolve_one_item(id=id, **kwargs)
[rank0]: File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]: File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]: File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]: [Previous line repeated 1 more time]
[rank0]: File "/opt/monai/monai/bundle/reference_resolver.py", line 175, in _resolve_one_item
[rank0]: item.evaluate(globals={f"{self._vars}": self.resolved_content}) if run_eval else item
[rank0]: File "/opt/monai/monai/bundle/config_item.py", line 376, in evaluate
[rank0]: raise RuntimeError(f"Failed to evaluate {self}") from e
[rank0]: RuntimeError: Failed to evaluate ConfigExpression:
[rank0]: ("$torch.load(__local_refs['training_network::arch_ckpt_path'], "
[rank0]: "map_location=torch.device('cuda'))")
[rank1]: Traceback (most recent call last):
[rank1]: File "/opt/monai/monai/bundle/config_item.py", line 374, in evaluate
[rank1]: return eval(value[len(self.prefix) :], globals_, locals)
[rank1]: File "<string>", line 1, in <module>
[rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1494, in load
[rank1]: raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank1]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
[rank1]: (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank1]: (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank1]: WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_reconstruct])` or the `torch.serialization.safe_globals([_reconstruct])` context manager to allowlist this global if you trust this class/function.
[rank1]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 1001, in <module>
[rank1]: fire.Fire()
[rank1]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 135, in Fire
[rank1]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 468, in _Fire
[rank1]: component, remaining_args = _CallAndUpdateTrace(
[rank1]: File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]: component = fn(*varargs, **kwargs)
[rank1]: File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 416, in run
[rank1]: model = parser.get_parsed_content("training_network#network")
[rank1]: File "/opt/monai/monai/bundle/config_parser.py", line 290, in get_parsed_content
[rank1]: return self.ref_resolver.get_resolved_content(id=id, **kwargs)
[rank1]: File "/opt/monai/monai/bundle/reference_resolver.py", line 193, in get_resolved_content
[rank1]: return self._resolve_one_item(id=id, **kwargs)
[rank1]: File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank1]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank1]: File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank1]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank1]: File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank1]: self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank1]: [Previous line repeated 1 more time]
[rank1]: File "/opt/monai/monai/bundle/reference_resolver.py", line 175, in _resolve_one_item
[rank1]: item.evaluate(globals={f"{self._vars}": self.resolved_content}) if run_eval else item
[rank1]: File "/opt/monai/monai/bundle/config_item.py", line 376, in evaluate
[rank1]: raise RuntimeError(f"Failed to evaluate {self}") from e
[rank1]: RuntimeError: Failed to evaluate ConfigExpression:
[rank1]: ("$torch.load(__local_refs['training_network::arch_ckpt_path'], "
[rank1]: "map_location=torch.device('cuda'))")
[rank0]:[W317 11:37:47.905422862 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0317 11:37:48.559000 5601 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 5622) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-03-17_11:37:48
host : yunliu-MS-7D31
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 5623)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-03-17_11:37:48
host : yunliu-MS-7D31
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 5622)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
E2025-03-17 11:37:48,770 - INFO - Found 2 GPUs for data analyzing!
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00, 1.05it/s]
2025-03-17 11:38:02,620 - INFO - Writing data stats to /tmp/tmp1wouaxqm/workdir/datastats.yaml.
2025-03-17 11:38:02,625 - INFO - Writing by-case data stats to /tmp/tmp1wouaxqm/workdir/datastats_by_case.yaml, this may take a while.
.
======================================================================
ERROR: test_ensemble (tests.integration.test_auto3dseg_ensemble.TestEnsembleBuilder)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/workspace/Code/MONAI/monai/utils/misc.py", line 892, in run_cmd
return subprocess.run(cmd_list, **kwargs)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--nproc_per_node', '2', '/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py', 'run', "--config_file='/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_infer.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_train.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_validate.yaml'", '--training#num_images_per_batch=2', '--training#num_epochs=2', '--training#num_epochs_per_validation=1']' returned non-zero exit status 1.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels