Skip to content

Pytorch dataloader not loading first-available data with multiple workers #105203

@TomEversdijk

Description

@TomEversdijk

🐛 Describe the bug

When using a dataloader with num_workers > 1 the batches are constructed in parallel to speed up the data-loading. I would expect that the dataloader returns the first-available data (FIFO-queue) to make sure that it runs as fast as possible. However it seems that each process will take turns returning data which slows don't data-loading quite significantly.

Below is a minimal code example:

import torch
import math
import time

class MyIterableDataset(torch.utils.data.IterableDataset):
    def __init__(self, start, end):
        super(MyIterableDataset).__init__()
        assert end > start, "this example code only works with end >= start"
        self.start = start
        self.end = end

    def give_data(self, start, end):
        for i in range(start, end):
            if i > 10:
                time.sleep(2)
            yield i

    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:  # single-process data loading, return the full iterator
            iter_start = self.start
            iter_end = self.end
        else:  # in a worker process
            # split workload
            per_worker = int(math.ceil((self.end - self.start) / float(worker_info.num_workers)))
            worker_id = worker_info.id
            iter_start = self.start + worker_id * per_worker
            iter_end = min(iter_start + per_worker, self.end)
        return self.give_data(iter_start, iter_end)
    
if __name__ == "__main__":
    ds = MyIterableDataset(start=0, end=20)

    # Mult-process loading with two worker processes
    for item in (torch.utils.data.DataLoader(ds, num_workers=2, batch_size=2)):
        print(item)

The result of this script is:

tensor([0, 1]) # Loaded fast
tensor([10, 11])  # Loaded slowly
tensor([2, 3])  # Loaded fast
tensor([12, 13])  # Loaded slowly
tensor([4, 5])  # Loaded fast
tensor([14, 15])  # Loaded slowly
tensor([6, 7])  # Loaded fast
tensor([16, 17])  # Loaded slowly
tensor([8, 9])  # Loaded fast
tensor([18, 19])  # Loaded slowly

However I would expect something like to be the result:

tensor([0, 1]) # Loaded fast
tensor([2, 3])  # Loaded fast
tensor([4, 5])  # Loaded fast
tensor([6, 7])  # Loaded fast
tensor([10, 11])  # Loaded slowly
tensor([8, 9])  # Loaded fast
tensor([12, 13])  # Loaded slowly
tensor([14, 15])  # Loaded slowly
tensor([16, 17])  # Loaded slowly
tensor([18, 19])  # Loaded slowly

Versions

Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.3.1 (arm64)
GCC version: Could not collect
Clang version: 14.0.3 (clang-1403.0.22.14.1)
CMake version: version 3.24.1
Libc version: N/A

Python version: 3.11.0 (main, Nov 30 2022, 13:48:51) [Clang 13.1.6 (clang-1316.0.21.2.5)] (64-bit runtime)
Python platform: macOS-13.3.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] flake8==6.0.0
[pip3] mypy==1.3.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.25.0
[pip3] pytorch-lightning==2.0.3
[pip3] torch==2.0.1
[pip3] torchmetrics==0.11.4
[conda] Could not collect

cc @ssnl @VitalyFedyunin @ejguan @dzhulgakov

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: dataloaderRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions