Initial version for multinode auto_runner and ensembler#6272
Initial version for multinode auto_runner and ensembler#6272wyli merged 25 commits intoProject-MONAI:devfrom
Conversation
|
Hi @heyufan1995 , I have a question that confuses me quite a bit. What's the benefits expectation of using multi-node to run the ensemble, as the ensemble execution is basically for-loops as below: |
@mingxin-zheng The files are partitioned to all nodes and gpus. So if you have 16 files, 2 nodes 16 GPUs, then each GPU will sequentially run 5 fold models on 1 file. I also think the for loop should be changed to
Since loading model weights can be slow. |
|
Hi @heyufan1995 , thanks for the explanation. Your proposal also makes sense. One possible concern I had was that the memory needs to hold (n_algos x n_files) before the ensemble takes place, which was the reason that I went one file by another. We may need to be careful and do something when there are lots of files to infer. By the way, |
Thanks, we can keep the current for loop. It's safer. |
Signed-off-by: heyufan1995 <[email protected]>
Signed-off-by: heyufan1995 <[email protected]>
Signed-off-by: heyufan1995 <[email protected]>
Signed-off-by: heyufan1995 <[email protected]>
Signed-off-by: heyufan1995 <[email protected]>
|
/black |
Signed-off-by: heyufan1995 <[email protected]>
Signed-off-by: heyufan1995 <[email protected]>
Signed-off-by: heyufan1995 <[email protected]>
|
Thanks @heyufan1995 for the PR. The design looks good to me as we minimize the number of breaking changes. There are some format issues and test failures to fix. CC @wyli for viz. |
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
|
/integration-test |
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
Signed-off-by: Wenqi Li <[email protected]>
|
/build |
wyli
left a comment
There was a problem hiding this comment.
integration verified https://github.com/Project-MONAI/MONAI/actions/runs/4699028053/jobs/8332016913 I'm merging this to unblock benchmarking tasks.
Fixes #6191
fixes #6259 .
Description
Big changes over autorunner to enable multinode training and multinode-multiGPU ensembler
Multiple changes:
Passed some quick local testing. Needs to fix details and do test. Created PR to do a initial design pattern discussion. Slack me if there is any major concern of the change.
@mingxin-zheng @wyli