Skip to content

Workaround OOM issue in classification 2D integration tests#3949

Merged
wyli merged 8 commits intoProject-MONAI:devfrom
Nic-Ma:fix-integration-oom
Mar 16, 2022
Merged

Workaround OOM issue in classification 2D integration tests#3949
wyli merged 8 commits intoProject-MONAI:devfrom
Nic-Ma:fix-integration-oom

Conversation

@Nic-Ma
Copy link
Copy Markdown
Contributor

@Nic-Ma Nic-Ma commented Mar 16, 2022

Description

During debugging #3934 , we found that the test_integration_classification_2d integration test uses more than 14GB memory which may cause OOM in CI server sometimes.
Then I tried to do lots experiments to identify the root cause with this command:
export CUDA_VISIBLE_DEVICES=$(python -m tests.utils); python -m tests.test_integration_classification_2d

  1. Run the latest MONAI code of dev branch in the latest PyTorch docker (21.12, 22.01, 22.02), 14GB memeory.
  2. Run the latest MONAI code of dev branch in the (21.02, 21.08, 21.10) PyTorch docker, 4GB memeory.
  3. pip installed latest PyTorch 1.11+cu102 in the very old 21.02 PyTorch docker, then run the latest MONAI dev code, 4GB memory.

So I think it seems the issue is not related to MONAI or PyTorch code, maybe some CUDA caching logic issue?

Then I tried to analyze the integration test and found that: in Case 1 when running the training, every process of DataLoader will occupy 839MB GPU memory and we set num_workers=10 in the test, the total GPU memory is 14GB:
image

But in Case 2 (21.02 docker), only the main process occupies GPU memory, so it's only 4GB:
image

I also tried to change other args of DataLoader, like: pin_memory, persistent_workers, etc. it doesn't change anything.

So this PR changed to num_workers from 10 to 1, it will much less memory and will fix the OOM issue in CI server.

Status

Ready

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
  • Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
  • In-line docstrings updated.
  • Documentation updated, tested make html command in the docs/ folder.

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/build

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/integration-test

@Nic-Ma Nic-Ma requested review from ericspod, rijobro and wyli March 16, 2022 03:58
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Mar 16, 2022

Installed latest PyTorch 1.11 in the very old 21.02 PyTorch docker, then run the latest MONAI dev code, 4GB memory.

docker 21.02 is with cuda 11.2, which version of pytorch 1.11 did you install?

(the ci still doesn't pass https://github.com/Project-MONAI/MONAI/actions/runs/1990631879)

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

Installed latest PyTorch 1.11 in the very old 21.02 PyTorch docker, then run the latest MONAI dev code, 4GB memory.

docker 21.02 is with cuda 11.2, which version of pytorch 1.11 did you install?

(the ci still doesn't pass https://github.com/Project-MONAI/MONAI/actions/runs/1990631879)

Hi @wyli ,

I deleted the torch in 21.02 and used pip install to install 1.11:

root@apt-sh-ai:/workspace/data/medical/MONAI# nvidia-smi
Wed Mar 16 08:37:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:02:00.0 Off |                    0 |
| N/A   33C    P0    35W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   33C    P0    36W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@apt-sh-ai:/workspace/data/medical/MONAI# python
Python 3.8.5 (default, Sep  4 2020, 07:30:14) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.11.0+cu102

Current CI error is not OOM anymore, it's because the answer is changed, I will update them soon, thanks for pointing:
ValueError: no matched results for integration_classification_2d, losses. [0.7778518290086917, 0.16298819936005174, 0.0758052768695886, 0.04581607829565835].

Thanks.

@Nic-Ma Nic-Ma changed the title Workaround OOM issue in classification 2D integration tests [WIP] Workaround OOM issue in classification 2D integration tests Mar 16, 2022
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

I am testing slightly older dockers, maybe we can temporarily roll back to an older docker to avoid changing test cases.

Thanks.

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

I tested below PyTorch dockers with MONAI dev code and the built-in PyTorch:

  1. 22.02: 14GB
  2. 22.01: 14GB
  3. 21.12: 14GB
  4. 21:10: 4GB, no dataloader process occupied GPU memory
  5. 21:08: 4GB
  6. 21:02: 4GB

I plan to temporarily roll back our MONAI docker to PyTorch 21.10, do you guys have any concerns?
@yiheng-wang-nv also helped to confirm 21.10 locally.

Thanks in advance.

@Nic-Ma Nic-Ma force-pushed the fix-integration-oom branch from e50ac48 to d061868 Compare March 16, 2022 12:40
@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

Hi @wyli ,

I think we can still run regular GPU tests with the latest dockers, so I only changed integration, cron, DockerFile.
What do you think?

Thanks.

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/black

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/build

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/integration-test

@Nic-Ma Nic-Ma changed the title [WIP] Workaround OOM issue in classification 2D integration tests Workaround OOM issue in classification 2D integration tests Mar 16, 2022
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Mar 16, 2022

/build

@Nic-Ma
Copy link
Copy Markdown
Contributor Author

Nic-Ma commented Mar 16, 2022

/integration-test

@wyli wyli enabled auto-merge (squash) March 16, 2022 16:52
@wyli wyli disabled auto-merge March 16, 2022 17:33
@wyli wyli enabled auto-merge (squash) March 16, 2022 17:33
@wyli wyli disabled auto-merge March 16, 2022 18:52
@wyli wyli enabled auto-merge (squash) March 16, 2022 18:52
@wyli wyli disabled auto-merge March 16, 2022 20:05
@wyli wyli enabled auto-merge (squash) March 16, 2022 20:05
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Mar 16, 2022

/build

@wyli wyli merged commit 6ea9742 into Project-MONAI:dev Mar 16, 2022
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Mar 16, 2022

confirming this works fine -- https://github.com/Project-MONAI/MONAI/runs/5576994468?check_suite_focus=true

wyli added a commit to wyli/MONAI that referenced this pull request Mar 22, 2022
wyli added a commit that referenced this pull request Mar 22, 2022
* fixes multiprocessing memory issue

Signed-off-by: Wenqi Li <[email protected]>

* Revert "Workaround OOM issue in classification 2D integration tests (#3949)"

This reverts commit 6ea9742.

Signed-off-by: Wenqi Li <[email protected]>

* update tests

Signed-off-by: Wenqi Li <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixes typo

Signed-off-by: Wenqi Li <[email protected]>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants