-
Notifications
You must be signed in to change notification settings - Fork 27.7k
DataLoader num_workers > 0 causes CPU memory from parent process to be replicated in all worker processes #13246
Copy link
Copy link
Open
Labels
high prioritymodule: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and Samplermodule: dependency bugProblem is not caused by us, but caused by an upstream library we useProblem is not caused by us, but caused by an upstream library we usemodule: memory usagePyTorch is using more memory than it should, or it is leaking memoryPyTorch is using more memory than it should, or it is leaking memorymodule: molly-guardFeatures which help prevent users from committing common mistakesFeatures which help prevent users from committing common mistakesmodule: multiprocessingRelated to torch.multiprocessingRelated to torch.multiprocessingtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Metadata
Metadata
Assignees
Labels
high prioritymodule: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and Samplermodule: dependency bugProblem is not caused by us, but caused by an upstream library we useProblem is not caused by us, but caused by an upstream library we usemodule: memory usagePyTorch is using more memory than it should, or it is leaking memoryPyTorch is using more memory than it should, or it is leaking memorymodule: molly-guardFeatures which help prevent users from committing common mistakesFeatures which help prevent users from committing common mistakesmodule: multiprocessingRelated to torch.multiprocessingRelated to torch.multiprocessingtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Editor note: There is a known workaround further down on this issue, which is to NOT use Python lists, but instead using something else, e.g., torch.tensor directly. See #13246 (comment) . You can use a numpy array, but it only fixes the issue for the
forkstart method. See #13246 (comment) for more details🐛 Bug
CPU memory will leak if the DataLoader
num_workers > 0.To Reproduce
Run the following snippet:
Expected behavior
CPU memory will gradually start increasing, eventually filling up the whole RAM. E.g., the process starts with around 15GB and fills up the whole 128GB available on the system.
When the
num_workers=0, RAM usage is constant.Environment
Additional info
There are around 24 million images in the dataset and all image paths are loaded into a single list as presented in the above code snippet.
I have also tried multiple Pytorch (0.4.0 and 0.4.1) versions and the effect is the same.
cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ssnl @VitalyFedyunin @ejguan