DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes

# Editor note: There is a known workaround further down on this issue, which is to NOT use Python lists, but instead using something else, e.g., torch.tensor directly. See https://github.com/pytorch/pytorch/issues/13246#issuecomment-905703662 . You can use a numpy array, but it only fixes the issue for the `fork` start method. See https://github.com/pytorch/pytorch/issues/13246#issuecomment-1364587359 for more details


## 🐛 Bug

CPU memory will leak if the DataLoader `num_workers > 0`.

## To Reproduce
Run the following snippet:

```
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from torchvision import transforms
import os

class DataIter(Dataset):
    def __init__(self):
        path = "path/to/data"
        self.data = []

        for cls in os.listdir(path):
            for img in os.listdir(os.path.join(path, cls)):
                self.data.append(os.path.join(path, cls, img))

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        with Image.open(self.data[idx]) as img:
            img = img.convert('RGB')
            return transforms.functional.to_tensor(img)


train_data = DataIter()
train_loader = DataLoader(train_data, batch_size=300,
                          shuffle=True,
                          drop_last=True,
                          pin_memory=False,
                          num_workers=18)

for i, item in enumerate(train_loader):
    if i % 200 == 0:
        print(i)
```
## Expected behavior

CPU memory will gradually start increasing, eventually filling up the whole RAM. E.g., the process starts with around 15GB and fills up the whole 128GB available on the system. 
When the `num_workers=0`, RAM usage is constant.

## Environment

```
PyTorch version: 1.0.0.dev20181028
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.4 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
CMake version: version 3.5.1

Python version: 3.5
Is CUDA available: Yes
CUDA runtime version: 9.0.176
GPU models and configuration: 
GPU 0: GeForce GTX 1080 Ti
GPU 1: GeForce GTX 1080 Ti
GPU 2: GeForce GTX 1080 Ti

Nvidia driver version: 390.67
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.1.4

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

PIL.__version__
'5.3.0'
```

## Additional info
There are around 24 million images in the dataset and all image paths are loaded into a single list as presented in the above code snippet.

I have also tried multiple Pytorch (0.4.0 and 0.4.1) versions and the effect is the same.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @SsnL @VitalyFedyunin @ejguan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #13246

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #13246

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions