Skip to content

Auto3dseg not support PyTorch 2.6+ #411

@KumoLiu

Description

@KumoLiu
root@yunliu-MS-7D31:/workspace/Code/MONAI# python -m unittest tests/integration/test_auto3dseg_ensemble.py 
2025-03-17 11:37:24,327 - INFO - Found 2 GPUs for data analyzing!
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.04it/s]
2025-03-17 11:37:38,226 - INFO - Writing data stats to /tmp/tmpdxux7zb8/workdir/datastats.yaml.
2025-03-17 11:37:38,230 - INFO - Writing by-case data stats to /tmp/tmpdxux7zb8/workdir/datastats_by_case.yaml, this may take a while.
2025-03-17 11:37:38,256 - INFO - BundleGen from https://github.com/Project-MONAI/research-contributions/releases/download/algo_templates/c970bdf.tar.gz
algo_templates.tar.gz: 104kB [00:01, 66.5kB/s]                                                                                      
2025-03-17 11:37:41,308 - INFO - Downloaded: /tmp/tmpf52e075l/algo_templates.tar.gz
2025-03-17 11:37:41,308 - INFO - Expected md5 is None, skip md5 check for file /tmp/tmpf52e075l/algo_templates.tar.gz.
2025-03-17 11:37:41,308 - INFO - Writing into directory: /tmp/tmpdxux7zb8/workdir.
2025-03-17 11:37:41,415 - INFO - Generated:/tmp/tmpdxux7zb8/workdir/dints_0
2025-03-17 11:37:41,443 - INFO - Generated:/tmp/tmpdxux7zb8/workdir/segresnet_0
2025-03-17 11:37:41,456 - INFO - segresnet2d_0 is skipped! SegresNet2D is skipped due to median spacing of [1.0, 1.0, 1.0],  which means the dataset is not highly anisotropic, e.g. spacing[2] < 3*(spacing[0] + spacing[1])/2) .
2025-03-17 11:37:41,573 - INFO - Generated:/tmp/tmpdxux7zb8/workdir/swinunetr_0
2025-03-17 11:37:41,577 - INFO - The keys num_warmup_epochs cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key num_warmup_epochs.
2025-03-17 11:37:41,577 - INFO - The keys use_pretrain cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key use_pretrain.
2025-03-17 11:37:41,577 - INFO - The keys pretrained_path cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key pretrained_path.
2025-03-17 11:37:41,577 - INFO - The keys determ cannot be found in the /tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key determ.
2025-03-17 11:37:41,578 - INFO - ['torchrun', '--nnodes', '1', '--nproc_per_node', '2', '/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py', 'run', "--config_file='/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_infer.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_train.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_validate.yaml'", '--training#num_images_per_batch=2', '--training#num_epochs=2', '--training#num_epochs_per_validation=1']
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] 
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] *****************************************
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0317 11:37:42.338000 5601 torch/distributed/run.py:792] *****************************************
[rank1]:[W317 11:37:46.090068696 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]:[W317 11:37:46.240209156 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/monai/monai/bundle/config_item.py", line 374, in evaluate
[rank0]:     return eval(value[len(self.prefix) :], globals_, locals)
[rank0]:   File "<string>", line 1, in <module>
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1494, in load
[rank0]:     raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank0]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
[rank0]:        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank0]:        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank0]:        WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_reconstruct])` or the `torch.serialization.safe_globals([_reconstruct])` context manager to allowlist this global if you trust this class/function.

[rank0]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 1001, in <module>
[rank0]:     fire.Fire()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 135, in Fire
[rank0]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 468, in _Fire
[rank0]:     component, remaining_args = _CallAndUpdateTrace(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]:     component = fn(*varargs, **kwargs)
[rank0]:   File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 416, in run
[rank0]:     model = parser.get_parsed_content("training_network#network")
[rank0]:   File "/opt/monai/monai/bundle/config_parser.py", line 290, in get_parsed_content
[rank0]:     return self.ref_resolver.get_resolved_content(id=id, **kwargs)
[rank0]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 193, in get_resolved_content
[rank0]:     return self._resolve_one_item(id=id, **kwargs)
[rank0]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank0]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank0]:   [Previous line repeated 1 more time]
[rank0]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 175, in _resolve_one_item
[rank0]:     item.evaluate(globals={f"{self._vars}": self.resolved_content}) if run_eval else item
[rank0]:   File "/opt/monai/monai/bundle/config_item.py", line 376, in evaluate
[rank0]:     raise RuntimeError(f"Failed to evaluate {self}") from e
[rank0]: RuntimeError: Failed to evaluate ConfigExpression: 
[rank0]: ("$torch.load(__local_refs['training_network::arch_ckpt_path'], "
[rank0]:  "map_location=torch.device('cuda'))")
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/monai/monai/bundle/config_item.py", line 374, in evaluate
[rank1]:     return eval(value[len(self.prefix) :], globals_, locals)
[rank1]:   File "<string>", line 1, in <module>
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1494, in load
[rank1]:     raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
[rank1]: _pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. 
[rank1]:        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
[rank1]:        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
[rank1]:        WeightsUnpickler error: Unsupported global: GLOBAL numpy.core.multiarray._reconstruct was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_reconstruct])` or the `torch.serialization.safe_globals([_reconstruct])` context manager to allowlist this global if you trust this class/function.

[rank1]: Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 1001, in <module>
[rank1]:     fire.Fire()
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 135, in Fire
[rank1]:     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 468, in _Fire
[rank1]:     component, remaining_args = _CallAndUpdateTrace(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank1]:     component = fn(*varargs, **kwargs)
[rank1]:   File "/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py", line 416, in run
[rank1]:     model = parser.get_parsed_content("training_network#network")
[rank1]:   File "/opt/monai/monai/bundle/config_parser.py", line 290, in get_parsed_content
[rank1]:     return self.ref_resolver.get_resolved_content(id=id, **kwargs)
[rank1]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 193, in get_resolved_content
[rank1]:     return self._resolve_one_item(id=id, **kwargs)
[rank1]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank1]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank1]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank1]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank1]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 163, in _resolve_one_item
[rank1]:     self._resolve_one_item(id=d, waiting_list=waiting_list, **kwargs)
[rank1]:   [Previous line repeated 1 more time]
[rank1]:   File "/opt/monai/monai/bundle/reference_resolver.py", line 175, in _resolve_one_item
[rank1]:     item.evaluate(globals={f"{self._vars}": self.resolved_content}) if run_eval else item
[rank1]:   File "/opt/monai/monai/bundle/config_item.py", line 376, in evaluate
[rank1]:     raise RuntimeError(f"Failed to evaluate {self}") from e
[rank1]: RuntimeError: Failed to evaluate ConfigExpression: 
[rank1]: ("$torch.load(__local_refs['training_network::arch_ckpt_path'], "
[rank1]:  "map_location=torch.device('cuda'))")
[rank0]:[W317 11:37:47.905422862 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0317 11:37:48.559000 5601 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 5622) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-03-17_11:37:48
  host      : yunliu-MS-7D31
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 5623)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-03-17_11:37:48
  host      : yunliu-MS-7D31
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5622)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
E2025-03-17 11:37:48,770 - INFO - Found 2 GPUs for data analyzing!
100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.05it/s]
2025-03-17 11:38:02,620 - INFO - Writing data stats to /tmp/tmp1wouaxqm/workdir/datastats.yaml.
2025-03-17 11:38:02,625 - INFO - Writing by-case data stats to /tmp/tmp1wouaxqm/workdir/datastats_by_case.yaml, this may take a while.
.
======================================================================
ERROR: test_ensemble (tests.integration.test_auto3dseg_ensemble.TestEnsembleBuilder)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/workspace/Code/MONAI/monai/utils/misc.py", line 892, in run_cmd
    return subprocess.run(cmd_list, **kwargs)
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--nproc_per_node', '2', '/tmp/tmpdxux7zb8/workdir/dints_0/scripts/train.py', 'run', "--config_file='/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/hyper_parameters_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/network_search.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_infer.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_train.yaml,/tmp/tmpdxux7zb8/workdir/dints_0/configs/transforms_validate.yaml'", '--training#num_images_per_batch=2', '--training#num_epochs=2', '--training#num_epochs_per_validation=1']' returned non-zero exit status 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions