Fix nvidia-smi query wrong gpus which fails MONAI integration test by mingxin-zheng · Pull Request #218 · Project-MONAI/research-contributions

mingxin-zheng · 2023-04-11T17:35:32Z

This PR aims to fix the integration issue.

The issue is caused when gpu:0 is not used, for example, under an environment variable of CUDA_VISIBLE_DEVICES=1,2. In such case, cuda in pytorch only finds gpu:1 and gpu:2, and index them as cuda:0, cuda:1 in the subprocess call. However, nvidia-smi will be able to see all gpus. When a gpu is selected, there is a mismatch between the one that nvidia-smi finds and the one pytorch can see.

This PR uses torch.cuda.mem_get_info instead of nvidia-smi to find the available memory from GPUs visible to cuda and pytorch. But the limitation is that this function API is available only after PyTorch 1.11, which means we need to skip the integration in prior versions.

…ation Signed-off-by: Mingxin Zheng <[email protected]>

Signed-off-by: Mingxin Zheng <[email protected]>

wyli · 2023-04-11T19:49:14Z

thanks @mingxin-zheng the training seems to be slow (and eventually timeout ) using this PR:
https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655

also I can see there are other issues that may not be covered by the tests, such as

research-contributions/auto3dseg/algorithm_templates/swinunetr/scripts/infer.py

Lines 77 to 78 in de5a8d1

    
           self.device = torch.device("cuda:0") 
        
           torch.cuda.set_device(self.device)

auto3dseg/algorithm_templates/dints/scripts/algo.py

mingxin-zheng · 2023-04-12T02:27:28Z

Hi @wyli , thank you for showing the test results in https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655.

Could this be the first time running the test on a 8-GPU node? In the test, there are 8 training images and the batch_size is 2. Maybe that cause the error posted?

The code changes in the PR does not apply to segresnet, also I think it should just affect test test_autorunner_gpu_customization.

Finally, it may be appropriate to use cuda:0 there. The infer call is managed in the same environment as the Auto3DSeg components such as AutoRunner. So when a command is issued with CUDA_VISIBLE_DEVICES=1, such as using ensemble, pytorch will use cuda:0 to find the first available GPU, which is gpu:1. Can you enlighten me any use case this may go wrong?

auto3dseg/algorithm_templates/dints/scripts/algo.py

wyli · 2023-04-12T05:53:18Z

Hi @wyli , thank you for showing the test results in https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655.

Could this be the first time running the test on a 8-GPU node? In the test, there are 8 training images and the batch_size is 2. Maybe that cause the error posted?

thanks @mingxin-zheng I forgot to add the environment variable, I'm retesting it now. agree with your analysis and Project-MONAI/MONAI#6342

wyli · 2023-04-12T06:35:07Z

thanks, the fixes works fine now https://github.com/Project-MONAI/MONAI/actions/runs/4675051127/jobs/8279777563. for Project-MONAI/MONAI#6342 is it just an integration test issue or a general auto3dseg usability issue? @mingxin-zheng

mingxin-zheng · 2023-04-12T07:16:46Z

Thank you @wyli . That's good to know it works. 6342 was an integration test issue I planned to fix, but got interrupted by other things. I will submit a PR on this, plus a PR to update the ALGO_HASH.

Fixes #6247 . ### Description It includes two changes: - Project-MONAI/research-contributions#218 - Project-MONAI/research-contributions#213 ### Types of changes  - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [x] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. Signed-off-by: Mingxin Zheng <[email protected]>

Fix nvidia-smi query wrong gpus which fails test_integration_customiz…

bc00b4c

…ation Signed-off-by: Mingxin Zheng <[email protected]>

mingxin-zheng requested review from dongyang0122 and wyli April 11, 2023 17:35

improve log

17ea2e1

Signed-off-by: Mingxin Zheng <[email protected]>

myron reviewed Apr 11, 2023

View reviewed changes

auto3dseg/algorithm_templates/dints/scripts/algo.py Show resolved Hide resolved

dongyang0122 reviewed Apr 12, 2023

View reviewed changes

auto3dseg/algorithm_templates/dints/scripts/algo.py Show resolved Hide resolved

mingxin-zheng merged commit b37ed82 into Project-MONAI:main Apr 12, 2023

mingxin-zheng mentioned this pull request Apr 12, 2023

Update AlGO_HASH Project-MONAI/MONAI#6346

Merged

2 tasks

mingxin-zheng deleted the hot-fix branch August 4, 2023 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nvidia-smi query wrong gpus which fails MONAI integration test#218

Fix nvidia-smi query wrong gpus which fails MONAI integration test#218
mingxin-zheng merged 2 commits intoProject-MONAI:mainfrom
mingxin-zheng:hot-fix

mingxin-zheng commented Apr 11, 2023 •

edited

Loading

Uh oh!

wyli commented Apr 11, 2023 •

edited

Loading

Uh oh!

Uh oh!

mingxin-zheng commented Apr 12, 2023

Uh oh!

Uh oh!

wyli commented Apr 12, 2023

Uh oh!

wyli commented Apr 12, 2023

Uh oh!

mingxin-zheng commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mingxin-zheng commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wyli commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mingxin-zheng commented Apr 12, 2023

Uh oh!

Uh oh!

wyli commented Apr 12, 2023

Uh oh!

wyli commented Apr 12, 2023

Uh oh!

mingxin-zheng commented Apr 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mingxin-zheng commented Apr 11, 2023 •

edited

Loading

wyli commented Apr 11, 2023 •

edited

Loading