-
Notifications
You must be signed in to change notification settings - Fork 6.7k
[MXNET-342] Fix the multi worker Dataloader #10628
[MXNET-342] Fix the multi worker Dataloader #10628
Conversation
|
@piiswrong if you want to review |
|
|
||
| def worker_loop(dataset, key_queue, data_queue, batchify_fn): | ||
| """Worker loop for multiprocessing DataLoader.""" | ||
| if isinstance(dataset, RecordFileDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do this kind of special casing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do you suggest to do it then?
In the same file you used several time special casing, e.g
if isinstance(data[0], nd.NDArray):
return nd.stack(*data)
elif isinstance(data[0], tuple):
data = zip(*data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piiswrong we could also have a reload method on Dataset that does nothing for all dataset except for the RecordFileDataset ones, that way there is no special casing.
| Reload the record file. | ||
| """ | ||
| idx_file = os.path.splitext(self._filename)[0] + '.idx' | ||
| self._record = recordio.MXIndexedRecordIO(idx_file, self._filename, 'r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should fix this in backend. Why would forking cause a problem? File descriptors should be duplicated when forking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the ideal solution indeed. https://groups.google.com/forum/#!topic/comp.lang.python/x-C31fCSZso
contrary to what I stated earlier, It looks like the actual problem could be that the file descriptors get closed rather than shared?
I don't see an easy way to set close_fds=False https://docs.python.org/3/library/subprocess.html#popen-constructor since we are using the multiprocessing package rather than subprocess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't close_fds false by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the subprocess package, it is True by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok digging a bit more, it seems that the multiprocessing package does not close file descriptors since it is simply calling os.fork(). I have updated the description of the PR to reflect the issue. tldr; a file description keeps track of the byte offset position it is in the file. When forking, all children processes get a duplicate of the original file descriptor, however they all refer to the same file description and when they try to move the current offset of the file description at the same time, they cause a crash.
|
this is a generic issue not specific to recordiodataset. Any dataset that opens a file could be affected. Does it behave the same way if the Dataset is written in python and opens file with |
|
It would behave the same way if the dataset relies on reading the file at run-time. To make it clearer, instead of checking the |
|
Do you know if pytorch has a solution to this? Pytorch's DataLoader and dataset works pretty similarly with ours |
|
Just had a read through their code, I didn't see anything that would mitigate that issue. However they also don't provide a dataset that would read through a single file and use |
|
closing for now after @piiswrong design concerns. The bug is still present though. |
|
I think pytorch is also suffering from similar problems: pytorch/pytorch#973 I think we can use ForkingPickler.register(recordio.MXRecordIO, reopen_recordio) to force reload record files when workers are forked. |
Description
MXNET-342
Fix #9974
DataLoader was not compatible with ImageRecordDataset as the file descriptors
was (probably) closed on forking (close_fds=True by default).The code in DataLoader use the
multiprocessingpackage, which does not close the file descriptors, as it is simply callingos.fork(). The problem is that the recordIter is callinglseekon the file descriptors of each fork, all pointing to the same open file description. Since all the forked processes share the same open file description, they are all trying to set the value oflseekat the same time, thus creating a crash, as they can't be reading the same file at different position using a single open file description.see https://stackoverflow.com/questions/4277289/are-file-descriptors-shared-when-forking
and https://stackoverflow.com/questions/11733481/can-anyone-explain-a-simple-description-regarding-file-descriptor-after-fork for more information of the distinction between
file descriptorandfile descriptionNow it reloads the record in each worker so that each individual process gets its own open file description.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.