-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Description
The following snippet runs out of memory. Commenting the all_reduce solves the issue.
#!/usr/bin/env python
import os
import torch as th
import torch.distributed as dist
from torch.multiprocessing import Process
def run(rank, size):
""" Distributed function to be implemented later. """
t = th.rand(100, 100)
for _ in range(10000000):
c = t.clone()
dist.all_reduce(c, dist.reduce_op.SUM)
t.set_(c)
def init_processes(rank, size, fn, backend='tcp'):
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)
fn(rank, size)
if __name__ == "__main__":
size = 4
processes = []
for rank in range(size):
p = Process(target=init_processes, args=(rank, size, run))
p.start()
processes.append(p)
for p in processes:
p.join()Possible issue: the retain statements in https://github.com/pytorch/pytorch/blob/master/torch/lib/THD/base/TensorDescriptor.cpp#L8
Metadata
Metadata
Assignees
Labels
No labels