[MXNET-342] Fix the multi worker Dataloader #10628

ThomasDelteil · 2018-04-20T17:35:21Z

Description

DataLoader was not compatible with ImageRecordDataset as the file descriptors ~~was (probably) closed on forking (close_fds=True by default)~~.

The code in DataLoader use the multiprocessing package, which does not close the file descriptors, as it is simply calling os.fork(). The problem is that the recordIter is calling lseek on the file descriptors of each fork, all pointing to the same open file description. Since all the forked processes share the same open file description, they are all trying to set the value of lseek at the same time, thus creating a crash, as they can't be reading the same file at different position using a single open file description.

see https://stackoverflow.com/questions/4277289/are-file-descriptors-shared-when-forking
and https://stackoverflow.com/questions/11733481/can-anyone-explain-a-simple-description-regarding-file-descriptor-after-fork for more information of the distinction between file descriptor and file description

Now it reloads the record in each worker so that each individual process gets its own open file description.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

ThomasDelteil · 2018-04-20T17:40:57Z

@piiswrong if you want to review

piiswrong · 2018-04-20T17:44:17Z

python/mxnet/gluon/data/dataloader.py


 def worker_loop(dataset, key_queue, data_queue, batchify_fn):
    """Worker loop for multiprocessing DataLoader."""
+    if isinstance(dataset, RecordFileDataset):


don't do this kind of special casing.

how do you suggest to do it then?

In the same file you used several time special casing, e.g

if isinstance(data[0], nd.NDArray): return nd.stack(*data) elif isinstance(data[0], tuple): data = zip(*data)

@piiswrong we could also have a reload method on Dataset that does nothing for all dataset except for the RecordFileDataset ones, that way there is no special casing.

piiswrong · 2018-04-20T17:46:04Z

python/mxnet/gluon/data/dataset.py

+        Reload the record file.
+        """
+        idx_file = os.path.splitext(self._filename)[0] + '.idx'
+        self._record = recordio.MXIndexedRecordIO(idx_file, self._filename, 'r')


we should fix this in backend. Why would forking cause a problem? File descriptors should be duplicated when forking

That would be the ideal solution indeed. https://groups.google.com/forum/#!topic/comp.lang.python/x-C31fCSZso
contrary to what I stated earlier, It looks like the actual problem could be that the file descriptors get closed rather than shared?
I don't see an easy way to set close_fds=False https://docs.python.org/3/library/subprocess.html#popen-constructor since we are using the multiprocessing package rather than subprocess.

isn't close_fds false by default?

For the subprocess package, it is True by default.

Ok digging a bit more, it seems that the multiprocessing package does not close file descriptors since it is simply calling os.fork(). I have updated the description of the PR to reflect the issue. tldr; a file description keeps track of the byte offset position it is in the file. When forking, all children processes get a duplicate of the original file descriptor, however they all refer to the same file description and when they try to move the current offset of the file description at the same time, they cause a crash.

piiswrong · 2018-04-22T05:34:20Z

this is a generic issue not specific to recordiodataset. Any dataset that opens a file could be affected. Does it behave the same way if the Dataset is written in python and opens file with open?

ThomasDelteil · 2018-04-22T06:04:12Z

It would behave the same way if the dataset relies on reading the file at run-time.

To make it clearer, instead of checking the RecordIODataset, we could have RecordIODataset inheriting from a new abstract class FileReadingDataset for example, that documents the behavior and has an abstract method reload_file that we call on the worker loop similarly as proposed in this PR.

piiswrong · 2018-04-23T19:21:28Z

Do you know if pytorch has a solution to this? Pytorch's DataLoader and dataset works pretty similarly with ours

ThomasDelteil · 2018-04-23T20:03:26Z

Just had a read through their code, I didn't see anything that would mitigate that issue. However they also don't provide a dataset that would read through a single file and use lseek to access records a la .rec file, so it is likely that they have never encountered that problem

ThomasDelteil · 2018-04-25T22:26:01Z

closing for now after @piiswrong design concerns. The bug is still present though.

zhreshold · 2018-05-25T17:56:10Z

I think pytorch is also suffering from similar problems: pytorch/pytorch#973

I think we can use ForkingPickler.register(recordio.MXRecordIO, reopen_recordio) to force reload record files when workers are forked.

Fix the multi worker dataloader

5d6d23d

ThomasDelteil requested a review from szha as a code owner April 20, 2018 17:35

update test

e1f5b33

ThomasDelteil mentioned this pull request Apr 20, 2018

DataLoader with workers not compatible with ImageRecordDataset #9974

Closed

piiswrong reviewed Apr 20, 2018

View reviewed changes

ThomasDelteil and others added 3 commits April 20, 2018 11:45

skip test on windows

40d2b3c

trigger build

00cd85e

trigger build

de4a98d

ThomasDelteil closed this Apr 25, 2018

zhreshold mentioned this pull request Jun 22, 2018

fix recordfile dataset with multi worker #11370

Merged

7 tasks

[MXNET-342] Fix the multi worker Dataloader #10628

[MXNET-342] Fix the multi worker Dataloader #10628

Uh oh!

Conversation

ThomasDelteil commented Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Essentials

Uh oh!

ThomasDelteil commented Apr 20, 2018

Uh oh!

piiswrong Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

ThomasDelteil Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasDelteil Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

piiswrong Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

ThomasDelteil Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piiswrong Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

ThomasDelteil Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ThomasDelteil Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piiswrong commented Apr 22, 2018

Uh oh!

ThomasDelteil commented Apr 22, 2018

Uh oh!

piiswrong commented Apr 23, 2018

Uh oh!

ThomasDelteil commented Apr 23, 2018

Uh oh!

ThomasDelteil commented Apr 25, 2018

Uh oh!

zhreshold commented May 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ThomasDelteil commented Apr 20, 2018 •

edited

Loading

ThomasDelteil Apr 20, 2018 •

edited

Loading

ThomasDelteil Apr 20, 2018 •

edited

Loading

ThomasDelteil Apr 20, 2018 •

edited

Loading

ThomasDelteil Apr 20, 2018 •

edited

Loading