Skip to content

Initial version for multinode auto_runner and ensembler#6272

Merged
wyli merged 25 commits intoProject-MONAI:devfrom
heyufan1995:multinode
Apr 14, 2023
Merged

Initial version for multinode auto_runner and ensembler#6272
wyli merged 25 commits intoProject-MONAI:devfrom
heyufan1995:multinode

Conversation

@heyufan1995
Copy link
Copy Markdown
Member

@heyufan1995 heyufan1995 commented Apr 3, 2023

Fixes #6191
fixes #6259 .

Description

Big changes over autorunner to enable multinode training and multinode-multiGPU ensembler
Multiple changes:

  1. Add set_device_info() to create a self.device_dict to define device information (CUDA_VISIBLE_DEVICES, NUM_NODE, e.t.c.) for all parts in autorunner, including data analyzer, trainer, ensembler. No global env variable is set, all device info is from self.device_dict. Changes to bundlegen is made.
  2. To enable multi-gpu/multi-node training for ensembler (call from subprocess), we need to separate the ensembler from autorunner (for subprocess to run from autorunner). Created a new EnsembleRunner class (similar to bundleGen), and moved all ensemble related function from autorunner to this class. Local multi-GPU ensembling passed.

Passed some quick local testing. Needs to fix details and do test. Created PR to do a initial design pattern discussion. Slack me if there is any major concern of the change.
@mingxin-zheng @wyli

@heyufan1995 heyufan1995 self-assigned this Apr 3, 2023
@mingxin-zheng
Copy link
Copy Markdown
Contributor

mingxin-zheng commented Apr 4, 2023

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:

for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

@heyufan1995
Copy link
Copy Markdown
Member Author

Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below:

for file in files:
    for algo in algos:
        pred = algo[some_key].predict(...)

@mingxin-zheng The files are partitioned to all nodes and gpus. So if you have 16 files, 2 nodes 16 GPUs, then each GPU will sequentially run 5 fold models on 1 file. I also think the for loop should be changed to

for algo in algos:
    for file in files:
        pred = algo[some_key].predict(...)

Since loading model weights can be slow.

@mingxin-zheng
Copy link
Copy Markdown
Contributor

mingxin-zheng commented Apr 4, 2023

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

@heyufan1995
Copy link
Copy Markdown
Member Author

Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer.

By the way, infer_instance.predict supports multiple files in one batch in the predict_files argument.

Thanks, we can keep the current for loop. It's safer.

Signed-off-by: heyufan1995 <[email protected]>
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Apr 11, 2023

/black

@mingxin-zheng
Copy link
Copy Markdown
Contributor

Thanks @heyufan1995 for the PR. The design looks good to me as we minimize the number of breaking changes. There are some format issues and test failures to fix. CC @wyli for viz.

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Apr 14, 2023

/integration-test

wyli added 3 commits April 14, 2023 05:55
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Apr 14, 2023

/build

Copy link
Copy Markdown
Contributor

@wyli wyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wyli wyli enabled auto-merge (squash) April 14, 2023 11:31
@wyli wyli merged commit 825b8db into Project-MONAI:dev Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

autorunner support multinode multigpu devices Enable Multi-node training

3 participants