Skip to content

Fix nvidia-smi query wrong gpus which fails MONAI integration test#218

Merged
mingxin-zheng merged 2 commits intoProject-MONAI:mainfrom
mingxin-zheng:hot-fix
Apr 12, 2023
Merged

Fix nvidia-smi query wrong gpus which fails MONAI integration test#218
mingxin-zheng merged 2 commits intoProject-MONAI:mainfrom
mingxin-zheng:hot-fix

Conversation

@mingxin-zheng
Copy link
Copy Markdown
Contributor

@mingxin-zheng mingxin-zheng commented Apr 11, 2023

This PR aims to fix the integration issue.

The issue is caused when gpu:0 is not used, for example, under an environment variable of CUDA_VISIBLE_DEVICES=1,2. In such case, cuda in pytorch only finds gpu:1 and gpu:2, and index them as cuda:0, cuda:1 in the subprocess call. However, nvidia-smi will be able to see all gpus. When a gpu is selected, there is a mismatch between the one that nvidia-smi finds and the one pytorch can see.

This PR uses torch.cuda.mem_get_info instead of nvidia-smi to find the available memory from GPUs visible to cuda and pytorch. But the limitation is that this function API is available only after PyTorch 1.11, which means we need to skip the integration in prior versions.

Signed-off-by: Mingxin Zheng <[email protected]>
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Apr 11, 2023

thanks @mingxin-zheng the training seems to be slow (and eventually timeout ) using this PR:
https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655

also I can see there are other issues that may not be covered by the tests, such as

self.device = torch.device("cuda:0")
torch.cuda.set_device(self.device)

@mingxin-zheng
Copy link
Copy Markdown
Contributor Author

Hi @wyli , thank you for showing the test results in https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655.

Could this be the first time running the test on a 8-GPU node? In the test, there are 8 training images and the batch_size is 2. Maybe that cause the error posted?

The code changes in the PR does not apply to segresnet, also I think it should just affect test test_autorunner_gpu_customization.

Finally, it may be appropriate to use cuda:0 there. The infer call is managed in the same environment as the Auto3DSeg components such as AutoRunner. So when a command is issued with CUDA_VISIBLE_DEVICES=1, such as using ensemble, pytorch will use cuda:0 to find the first available GPU, which is gpu:1. Can you enlighten me any use case this may go wrong?

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Apr 12, 2023

Hi @wyli , thank you for showing the test results in https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655.

Could this be the first time running the test on a 8-GPU node? In the test, there are 8 training images and the batch_size is 2. Maybe that cause the error posted?

thanks @mingxin-zheng I forgot to add the environment variable, I'm retesting it now. agree with your analysis and Project-MONAI/MONAI#6342

@wyli
Copy link
Copy Markdown
Contributor

wyli commented Apr 12, 2023

thanks, the fixes works fine now https://github.com/Project-MONAI/MONAI/actions/runs/4675051127/jobs/8279777563. for Project-MONAI/MONAI#6342 is it just an integration test issue or a general auto3dseg usability issue? @mingxin-zheng

@mingxin-zheng
Copy link
Copy Markdown
Contributor Author

Thank you @wyli . That's good to know it works. 6342 was an integration test issue I planned to fix, but got interrupted by other things. I will submit a PR on this, plus a PR to update the ALGO_HASH.

@mingxin-zheng mingxin-zheng merged commit b37ed82 into Project-MONAI:main Apr 12, 2023
wyli pushed a commit to Project-MONAI/MONAI that referenced this pull request Apr 12, 2023
Fixes #6247  .

### Description

It includes two changes:
- Project-MONAI/research-contributions#218
- Project-MONAI/research-contributions#213

### Types of changes
<!--- Put an `x` in all the boxes that apply, and remove the not
applicable items -->
- [x] Non-breaking change (fix or new feature that would not break
existing functionality).
- [x] Integration tests passed locally by running `./runtests.sh -f -u
--net --coverage`.

Signed-off-by: Mingxin Zheng <[email protected]>
@mingxin-zheng mingxin-zheng deleted the hot-fix branch August 4, 2023 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants