Skip to content

Add instructions for GPU-enabled PyTorch to installation.md#5600

Merged
wyli merged 1 commit intoProject-MONAI:devfrom
dzenanz:dev
Nov 29, 2022
Merged

Add instructions for GPU-enabled PyTorch to installation.md#5600
wyli merged 1 commit intoProject-MONAI:devfrom
dzenanz:dev

Conversation

@dzenanz
Copy link
Copy Markdown
Contributor

@dzenanz dzenanz commented Nov 28, 2022

Fixes #5333 .

Types of changes

  • Documentation updated, tested make html command in the docs/ folder.

Signed-off-by: Dženan Zukić <[email protected]>
Co-authored-by: Mingxin Zheng <[email protected]>
@dzenanz
Copy link
Copy Markdown
Contributor Author

dzenanz commented Nov 28, 2022

The failing test is probably unrelated:

tests/test_cumulative_average_dist.py
[W socket.cpp:601] [c10d] The client socket has failed to connect to [localhost]:18964 (errno: 99 - Cannot assign requested address).
2022-11-28 22:40:36,683 - Added key: store_based_barrier_key:1 to store for rank: 1
2022-11-28 22:40:36,686 - Added key: store_based_barrier_key:1 to store for rank: 0
2022-11-28 22:40:36,686 - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2022-11-28 22:40:36,694 - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 485, in run_process
    raise e
  File "/__w/MONAI/MONAI/tests/utils.py", line 476, in run_process
    func(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 644, in _call_original_func
    return f(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/test_cumulative_average_dist.py", line 38, in test_value
    val = torch.as_tensor(rank + i, device=device)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
F
======================================================================
FAIL: test_value (__main__.DistributedCumulativeAverage)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/MONAI/MONAI/tests/utils.py", line 521, in _wrapper
    assert results.get(), "Distributed call failed."
AssertionError: Distributed call failed.

----------------------------------------------------------------------
Ran 1 test in 6.986s

FAILED (failures=1)
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 485, in run_process
    raise e
  File "/__w/MONAI/MONAI/tests/utils.py", line 476, in run_process
    func(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/utils.py", line 644, in _call_original_func
    return f(*args, **kwargs)
  File "/__w/MONAI/MONAI/tests/test_cumulative_average_dist.py", line 41, in test_value
    avg_val = avg_meter.aggregate()  # average across all processes
  File "/__w/MONAI/MONAI/monai/metrics/cumulative_average.py", line 94, in aggregate
    dist.all_reduce(sum)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1381, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Broken pipe
1069
[276](https://github.com/Project-MONAI/MONAI/actions/runs/3568578621/jobs/5997607139#step:7:277)5
Error: Process completed with exit code 1.

@wyli wyli enabled auto-merge (squash) November 28, 2022 23:30
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 28, 2022

/build

1 similar comment
@wyli
Copy link
Copy Markdown
Contributor

wyli commented Nov 29, 2022

/build

@wyli wyli merged commit 070c97a into Project-MONAI:dev Nov 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve installation guide for Windows GPU users

2 participants