Skip to content

th.distributed.all_reduce() memory leak #1827

@seba-1511

Description

@seba-1511

The following snippet runs out of memory. Commenting the all_reduce solves the issue.

#!/usr/bin/env python
import os
import torch as th
import torch.distributed as dist
from torch.multiprocessing import Process

def run(rank, size):
    """ Distributed function to be implemented later. """
    t = th.rand(100, 100)
    for _ in range(10000000):
        c = t.clone()
        dist.all_reduce(c, dist.reduce_op.SUM)
        t.set_(c)

def init_processes(rank, size, fn, backend='tcp'):
    """ Initialize the distributed environment. """
    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    size = 4
    processes = []
    for rank in range(size):
        p = Process(target=init_processes, args=(rank, size, run))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()

Possible issue: the retain statements in https://github.com/pytorch/pytorch/blob/master/torch/lib/THD/base/TensorDescriptor.cpp#L8

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions