Initial version for multinode auto_runner and ensembler by heyufan1995 · Pull Request #6272 · Project-MONAI/MONAI

heyufan1995 · 2023-04-03T04:17:12Z

Fixes #6191
fixes #6259 .

Description

Big changes over autorunner to enable multinode training and multinode-multiGPU ensembler
Multiple changes:

Add set_device_info() to create a self.device_dict to define device information (CUDA_VISIBLE_DEVICES, NUM_NODE, e.t.c.) for all parts in autorunner, including data analyzer, trainer, ensembler. No global env variable is set, all device info is from self.device_dict. Changes to bundlegen is made.
To enable multi-gpu/multi-node training for ensembler (call from subprocess), we need to separate the ensembler from autorunner (for subprocess to run from autorunner). Created a new EnsembleRunner class (similar to bundleGen), and moved all ensemble related function from autorunner to this class. Local multi-GPU ensembling passed.

Passed some quick local testing. Needs to fix details and do test. Created PR to do a initial design pattern discussion. Slack me if there is any major concern of the change.
@mingxin-zheng @wyli

monai/apps/auto3dseg/auto_runner.py

monai/apps/auto3dseg/bundle_gen.py

mingxin-zheng · 2023-04-04T06:42:32Z

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:

for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

heyufan1995 · 2023-04-04T13:57:22Z

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:
for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

@mingxin-zheng The files are partitioned to all nodes and gpus. So if you have 16 files, 2 nodes 16 GPUs, then each GPU will sequentially run 5 fold models on 1 file. I also think the for loop should be changed to

for algo in algos:
    for file in files:
        pred = algo[some_key].predict(...)

Since loading model weights can be slow.

mingxin-zheng · 2023-04-04T14:24:14Z

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

heyufan1995 · 2023-04-04T14:48:56Z

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

Thanks, we can keep the current for loop. It's safer.

monai/apps/auto3dseg/bundle_gen.py

Signed-off-by: heyufan1995 <[email protected]>

for more information, see https://pre-commit.ci

wyli · 2023-04-11T23:22:49Z

/black

Signed-off-by: heyufan1995 <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: heyufan1995 <[email protected]>

for more information, see https://pre-commit.ci

monai/apps/auto3dseg/bundle_gen.py

Signed-off-by: heyufan1995 <[email protected]>

for more information, see https://pre-commit.ci

monai/apps/auto3dseg/auto_runner.py

mingxin-zheng · 2023-04-14T06:40:44Z

Thanks @heyufan1995 for the PR. The design looks good to me as we minimize the number of breaking changes. There are some format issues and test failures to fix. CC @wyli for viz.

Signed-off-by: Wenqi Li <[email protected]>

monai/apps/auto3dseg/ensemble_builder.py

Signed-off-by: Wenqi Li <[email protected]>

wyli · 2023-04-14T09:47:43Z

/integration-test

monai/apps/auto3dseg/bundle_gen.py

monai/apps/auto3dseg/ensemble_builder.py

Signed-off-by: Wenqi Li <[email protected]>

monai/apps/auto3dseg/ensemble_builder.py

wyli · 2023-04-14T10:54:27Z

/build

wyli

integration verified https://github.com/Project-MONAI/MONAI/actions/runs/4699028053/jobs/8332016913 I'm merging this to unblock benchmarking tasks.

heyufan1995 self-assigned this Apr 3, 2023

heyufan1995 added this to the Auto3DSeg enhancement [P0 v1.2] milestone Apr 3, 2023

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 3, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

heyufan1995 force-pushed the multinode branch from 8ab65cb to e1af5b1 Compare April 3, 2023 18:37

heyufan1995 requested review from dongyang0122 and myron April 3, 2023 18:38

mingxin-zheng reviewed Apr 4, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

heyufan1995 force-pushed the multinode branch from 5b254ff to eaba8e4 Compare April 5, 2023 15:46

wyli reviewed Apr 6, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

heyufan1995 added 4 commits April 11, 2023 13:36

Initial version for multinode auto_runner and ensembler

2af78e5

Signed-off-by: heyufan1995 <[email protected]>

Fix multiple bugs and able to run end-to-end ngc multinode

cf7dbd6

Signed-off-by: heyufan1995 <[email protected]>

Add cmd_prefix in env

fd89cdd

Signed-off-by: heyufan1995 <[email protected]>

Update minor logging issue

4f671be

Signed-off-by: heyufan1995 <[email protected]>

heyufan1995 force-pushed the multinode branch from 705c438 to 4f671be Compare April 11, 2023 20:18

Fix merge issues

a067d28

Signed-off-by: heyufan1995 <[email protected]>

heyufan1995 force-pushed the multinode branch from 70159e0 to a067d28 Compare April 11, 2023 20:27

pre-commit-ci bot and others added 2 commits April 11, 2023 20:27

[pre-commit.ci] auto fixes from pre-commit.com hooks

3735054

for more information, see https://pre-commit.ci

Merge branch 'dev' into multinode

e8fcd37

heyufan1995 and others added 3 commits April 13, 2023 00:13

Change init function position in ensemblerunner

20a5d77

Signed-off-by: heyufan1995 <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

44ff651

for more information, see https://pre-commit.ci

Merge branch 'dev' into multinode

1f7284a

heyufan1995 and others added 3 commits April 13, 2023 11:26

Add set_ensemble_method back to autorunner

c6c2d55

Signed-off-by: heyufan1995 <[email protected]>

Merge branch 'multinode' of github.com:heyufan1995/MONAI into multinode

45e9fd0

[pre-commit.ci] auto fixes from pre-commit.com hooks

ad1e681

for more information, see https://pre-commit.ci

mingxin-zheng reviewed Apr 13, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 13, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 13, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Outdated Show resolved Hide resolved

heyufan1995 and others added 3 commits April 14, 2023 00:52

Add test case and addressing several comments

3ebed6a

Signed-off-by: heyufan1995 <[email protected]>

Merge branch 'multinode' of github.com:heyufan1995/MONAI into multinode

e32c8ac

[pre-commit.ci] auto fixes from pre-commit.com hooks

0834bb0

for more information, see https://pre-commit.ci

mingxin-zheng reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

mingxin-zheng reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/auto_runner.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/dev' into multinode

f6acd3e

Signed-off-by: Wenqi Li <[email protected]>

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Outdated Show resolved Hide resolved

wyli added 3 commits April 14, 2023 05:33

typing fixes and unit tests

c053912

Signed-off-by: Wenqi Li <[email protected]>

typing fixes and unit tests

4ddc909

Signed-off-by: Wenqi Li <[email protected]>

backward compatible _create_cmd

3389672

Signed-off-by: Wenqi Li <[email protected]>

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/bundle_gen.py Show resolved Hide resolved

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Show resolved Hide resolved

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Show resolved Hide resolved

wyli added 3 commits April 14, 2023 05:55

autofix

6317377

Signed-off-by: Wenqi Li <[email protected]>

compatible algo.train

b4a5d1d

Signed-off-by: Wenqi Li <[email protected]>

backward compatibility

00f0fde

Signed-off-by: Wenqi Li <[email protected]>

wyli reviewed Apr 14, 2023

View reviewed changes

monai/apps/auto3dseg/ensemble_builder.py Show resolved Hide resolved

wyli approved these changes Apr 14, 2023

View reviewed changes

wyli enabled auto-merge (squash) April 14, 2023 11:31

wyli merged commit 825b8db into Project-MONAI:dev Apr 14, 2023

myron mentioned this pull request Apr 14, 2023

TestEnsembleBuilder fails: No such file or directory: 'OMP_NUM_THREADS=1' #6369

Closed

Conversation

heyufan1995 commented Apr 3, 2023 • edited by wyli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mingxin-zheng commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heyufan1995 commented Apr 4, 2023

Uh oh!

mingxin-zheng commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

heyufan1995 commented Apr 4, 2023

Uh oh!

Uh oh!

wyli commented Apr 11, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mingxin-zheng commented Apr 14, 2023

Uh oh!

Uh oh!

wyli commented Apr 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wyli commented Apr 14, 2023

Uh oh!

wyli left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

heyufan1995 commented Apr 3, 2023 •

edited by wyli

Loading

mingxin-zheng commented Apr 4, 2023 •

edited

Loading

mingxin-zheng commented Apr 4, 2023 •

edited

Loading

wyli commented Apr 14, 2023 •

edited

Loading