Fix nvidia-smi query wrong gpus which fails MONAI integration test#218
Fix nvidia-smi query wrong gpus which fails MONAI integration test#218mingxin-zheng merged 2 commits intoProject-MONAI:mainfrom
Conversation
…ation Signed-off-by: Mingxin Zheng <[email protected]>
Signed-off-by: Mingxin Zheng <[email protected]>
|
thanks @mingxin-zheng the training seems to be slow (and eventually timeout ) using this PR: also I can see there are other issues that may not be covered by the tests, such as |
|
Hi @wyli , thank you for showing the test results in https://github.com/Project-MONAI/MONAI/actions/runs/4671264330/jobs/8272023655. Could this be the first time running the test on a 8-GPU node? In the test, there are 8 training images and the batch_size is 2. Maybe that cause the error posted? The code changes in the PR does not apply to Finally, it may be appropriate to use |
thanks @mingxin-zheng I forgot to add the environment variable, I'm retesting it now. agree with your analysis and Project-MONAI/MONAI#6342 |
|
thanks, the fixes works fine now https://github.com/Project-MONAI/MONAI/actions/runs/4675051127/jobs/8279777563. for Project-MONAI/MONAI#6342 is it just an integration test issue or a general auto3dseg usability issue? @mingxin-zheng |
|
Thank you @wyli . That's good to know it works. 6342 was an integration test issue I planned to fix, but got interrupted by other things. I will submit a PR on this, plus a PR to update the ALGO_HASH. |
Fixes #6247 . ### Description It includes two changes: - Project-MONAI/research-contributions#218 - Project-MONAI/research-contributions#213 ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [x] Integration tests passed locally by running `./runtests.sh -f -u --net --coverage`. Signed-off-by: Mingxin Zheng <[email protected]>
This PR aims to fix the integration issue.
The issue is caused when
gpu:0is not used, for example, under an environment variable ofCUDA_VISIBLE_DEVICES=1,2. In such case,cudainpytorchonly findsgpu:1andgpu:2, and index them ascuda:0,cuda:1in the subprocess call. However,nvidia-smiwill be able to see all gpus. When a gpu is selected, there is a mismatch between the one thatnvidia-smifinds and the onepytorchcan see.This PR uses
torch.cuda.mem_get_infoinstead ofnvidia-smito find the available memory from GPUs visible tocudaandpytorch. But the limitation is that this function API is available only after PyTorch 1.11, which means we need to skip the integration in prior versions.