Workaround OOM issue in classification 2D integration tests by Nic-Ma · Pull Request #3949 · Project-MONAI/MONAI

Nic-Ma · 2022-03-16T03:55:46Z

Description

During debugging #3934 , we found that the test_integration_classification_2d integration test uses more than 14GB memory which may cause OOM in CI server sometimes.
Then I tried to do lots experiments to identify the root cause with this command:
export CUDA_VISIBLE_DEVICES=$(python -m tests.utils); python -m tests.test_integration_classification_2d

Run the latest MONAI code of dev branch in the latest PyTorch docker (21.12, 22.01, 22.02), 14GB memeory.
Run the latest MONAI code of dev branch in the (21.02, 21.08, 21.10) PyTorch docker, 4GB memeory.
pip installed latest PyTorch 1.11+cu102 in the very old 21.02 PyTorch docker, then run the latest MONAI dev code, 4GB memory.

So I think it seems the issue is not related to MONAI or PyTorch code, maybe some CUDA caching logic issue?

Then I tried to analyze the integration test and found that: in Case 1 when running the training, every process of DataLoader will occupy 839MB GPU memory and we set num_workers=10 in the test, the total GPU memory is 14GB:

But in Case 2 (21.02 docker), only the main process occupies GPU memory, so it's only 4GB:

I also tried to change other args of DataLoader, like: pin_memory, persistent_workers, etc. it doesn't change anything.

So this PR changed to num_workers from 10 to 1, it will much less memory and will fix the OOM issue in CI server.

Status

Ready

Types of changes

Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.

merge master

Nic-Ma · 2022-03-16T03:56:15Z

/black

Nic-Ma · 2022-03-16T03:56:22Z

/build

Nic-Ma · 2022-03-16T03:57:32Z

/integration-test

wyli · 2022-03-16T08:19:29Z

Installed latest PyTorch 1.11 in the very old 21.02 PyTorch docker, then run the latest MONAI dev code, 4GB memory.

docker 21.02 is with cuda 11.2, which version of pytorch 1.11 did you install?

(the ci still doesn't pass https://github.com/Project-MONAI/MONAI/actions/runs/1990631879)

Nic-Ma · 2022-03-16T08:31:40Z

Installed latest PyTorch 1.11 in the very old 21.02 PyTorch docker, then run the latest MONAI dev code, 4GB memory.

docker 21.02 is with cuda 11.2, which version of pytorch 1.11 did you install?

(the ci still doesn't pass https://github.com/Project-MONAI/MONAI/actions/runs/1990631879)

Hi @wyli ,

I deleted the torch in 21.02 and used pip install to install 1.11:

root@apt-sh-ai:/workspace/data/medical/MONAI# nvidia-smi
Wed Mar 16 08:37:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   33C    P0    35W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   33C    P0    36W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@apt-sh-ai:/workspace/data/medical/MONAI# python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.11.0+cu102

Current CI error is not OOM anymore, it's because the answer is changed, I will update them soon, thanks for pointing:
ValueError: no matched results for integration_classification_2d, losses. [0.7778518290086917, 0.16298819936005174, 0.0758052768695886, 0.04581607829565835].

Thanks.

Nic-Ma · 2022-03-16T08:45:25Z

I am testing slightly older dockers, maybe we can temporarily roll back to an older docker to avoid changing test cases.

Thanks.

Nic-Ma · 2022-03-16T09:40:42Z

I tested below PyTorch dockers with MONAI dev code and the built-in PyTorch:

22.02: 14GB
22.01: 14GB
21.12: 14GB
21:10: 4GB, no dataloader process occupied GPU memory
21:08: 4GB
21:02: 4GB

I plan to temporarily roll back our MONAI docker to PyTorch 21.10, do you guys have any concerns?
@yiheng-wang-nv also helped to confirm 21.10 locally.

Thanks in advance.

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma · 2022-03-16T12:43:11Z

Hi @wyli ,

I think we can still run regular GPU tests with the latest dockers, so I only changed integration, cron, DockerFile.
What do you think?

Thanks.

Nic-Ma · 2022-03-16T12:43:24Z

/black

Nic-Ma · 2022-03-16T12:43:31Z

/build

Nic-Ma · 2022-03-16T12:44:37Z

/integration-test

.github/workflows/integration.yml

Signed-off-by: Nic Ma <[email protected]>

wyli · 2022-03-16T16:19:51Z

/build

Nic-Ma · 2022-03-16T16:20:24Z

/integration-test

wyli · 2022-03-16T20:06:14Z

/build

wyli · 2022-03-16T22:00:37Z

confirming this works fine -- https://github.com/Project-MONAI/MONAI/runs/5576994468?check_suite_focus=true

…roject-MONAI#3949)" This reverts commit 6ea9742. Signed-off-by: Wenqi Li <[email protected]>

* fixes multiprocessing memory issue Signed-off-by: Wenqi Li <[email protected]> * Revert "Workaround OOM issue in classification 2D integration tests (#3949)" This reverts commit 6ea9742. Signed-off-by: Wenqi Li <[email protected]> * update tests Signed-off-by: Wenqi Li <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes typo Signed-off-by: Wenqi Li <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Nic-Ma added 5 commits February 1, 2021 19:15

Merge pull request #19 from Project-MONAI/master

42a45e0

merge master

Merge pull request #32 from Project-MONAI/master

cd16a13

merge master

Merge pull request #180 from Project-MONAI/dev

6f87afd

merge master

Merge pull request #214 from Project-MONAI/dev

f398298

merge master

Merge pull request #387 from Project-MONAI/dev

47e6571

merge master

Nic-Ma requested review from ericspod, rijobro and wyli March 16, 2022 03:58

Nic-Ma changed the title ~~Workaround OOM issue in classification 2D integration tests~~ [WIP] Workaround OOM issue in classification 2D integration tests Mar 16, 2022

[DLMED] update to 21.10

d061868

Signed-off-by: Nic Ma <[email protected]>

Nic-Ma force-pushed the fix-integration-oom branch from e50ac48 to d061868 Compare March 16, 2022 12:40

Nic-Ma changed the title ~~[WIP] Workaround OOM issue in classification 2D integration tests~~ Workaround OOM issue in classification 2D integration tests Mar 16, 2022

wyli reviewed Mar 16, 2022

View reviewed changes

.github/workflows/integration.yml Show resolved Hide resolved

Nic-Ma added 2 commits March 17, 2022 00:03

[DLMED] fix typo in integration test config

ddde2f6

Signed-off-by: Nic Ma <[email protected]>

Merge branch 'dev' into fix-integration-oom

8a0dfad

wyli approved these changes Mar 16, 2022

View reviewed changes

wyli enabled auto-merge (squash) March 16, 2022 16:52

wyli mentioned this pull request Mar 16, 2022

Revert the workaround of base image downgrading (PR 3949) #3961

Closed

wyli disabled auto-merge March 16, 2022 17:33

wyli enabled auto-merge (squash) March 16, 2022 17:33

wyli disabled auto-merge March 16, 2022 18:52

wyli enabled auto-merge (squash) March 16, 2022 18:52

wyli disabled auto-merge March 16, 2022 20:05

wyli enabled auto-merge (squash) March 16, 2022 20:05

wyli merged commit 6ea9742 into Project-MONAI:dev Mar 16, 2022

wyli added a commit to wyli/MONAI that referenced this pull request Mar 22, 2022

Revert "Workaround OOM issue in classification 2D integration tests (P…

408dc79

…roject-MONAI#3949)" This reverts commit 6ea9742. Signed-off-by: Wenqi Li <[email protected]>

Conversation

Nic-Ma commented Mar 16, 2022 • edited by wyli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Status

Types of changes

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

wyli commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nic-Ma commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

Uh oh!

wyli commented Mar 16, 2022

Uh oh!

Nic-Ma commented Mar 16, 2022

Uh oh!

wyli commented Mar 16, 2022

Uh oh!

wyli commented Mar 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Nic-Ma commented Mar 16, 2022 •

edited by wyli

Loading

wyli commented Mar 16, 2022 •

edited

Loading

Nic-Ma commented Mar 16, 2022 •

edited

Loading

Nic-Ma commented Mar 16, 2022 •

edited

Loading