Add instructions for GPU-enabled PyTorch to installation.md by dzenanz · Pull Request #5600 · Project-MONAI/MONAI

dzenanz · 2022-11-28T20:27:26Z

Fixes #5333 .

Types of changes

Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Dženan Zukić <[email protected]> Co-authored-by: Mingxin Zheng <[email protected]>

dzenanz · 2022-11-28T23:13:32Z

The failing test is probably unrelated:

tests/test_cumulative_average_dist.py
[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:18964 (errno: 99 - Cannot assign requested address).
2022-11-28 22:40:36,683 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-28 22:40:36,686 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-28 22:40:36,686 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-28 22:40:36,694 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 485, in run_process
    raise e
  File "/__w/MONAI/MONAI/tests/utils.py", line 476, in run_process
    func(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 644, in _call_original_func
    return f(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/test_cumulative_average_dist.py", line 38, in test_value
    val = torch.as_tensor(rank + i, device=device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
F
======================================================================
FAIL: test_value (__main__.DistributedCumulativeAverage)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/MONAI/MONAI/tests/utils.py", line 521, in _wrapper
    assert results.get(), "Distributed call failed."
AssertionError: Distributed call failed.

----------------------------------------------------------------------
Ran 1 test in 6.986s

FAILED (failures=1)
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 485, in run_process
    raise e
  File "/__w/MONAI/MONAI/tests/utils.py", line 476, in run_process
    func(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 644, in _call_original_func
    return f(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/test_cumulative_average_dist.py", line 41, in test_value
    avg_val = avg_meter.aggregate()  # average across all processes
  File "/__w/MONAI/MONAI/monai/metrics/cumulative_average.py", line 94, in aggregate
    dist.all_reduce(sum)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1381, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe
1069
[276](https://github.com/Project-MONAI/MONAI/actions/runs/3568578621/jobs/5997607139#step:7:277)5
Error: Process completed with exit code 1.

wyli · 2022-11-28T23:31:34Z

/build

wyli · 2022-11-29T07:42:09Z

/build

Add instructions for GPU-enabled PyTorch to installation.md

e07995f

Signed-off-by: Dženan Zukić <[email protected]> Co-authored-by: Mingxin Zheng <[email protected]>

dzenanz force-pushed the dev branch from 272b443 to e07995f Compare November 28, 2022 20:55

wyli enabled auto-merge (squash) November 28, 2022 23:30

wyli approved these changes Nov 28, 2022

View reviewed changes

wyli merged commit 070c97a into Project-MONAI:dev Nov 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add instructions for GPU-enabled PyTorch to installation.md#5600

Add instructions for GPU-enabled PyTorch to installation.md#5600
wyli merged 1 commit intoProject-MONAI:devfrom
dzenanz:dev

dzenanz commented Nov 28, 2022

Uh oh!

dzenanz commented Nov 28, 2022

Uh oh!

wyli commented Nov 28, 2022

Uh oh!

wyli commented Nov 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dzenanz commented Nov 28, 2022

Types of changes

Uh oh!

dzenanz commented Nov 28, 2022

Uh oh!

wyli commented Nov 28, 2022

Uh oh!

wyli commented Nov 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants