-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix dataloader hang when it is not completely iterated #9655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
This is something I was thinking of too. I don't know why but it still produces (very occasionally) a hang. I totally agree it should work, but will try to figure out why it still hangs on occasion. |
|
@pytorchbot retest this please |
|
Been running a script over and over today to check, only happened twice out of many hundred so who knows. I think it should be fine. |
|
Did it hang at one of the joins? If so, which one was it?
…On Sat, Jul 21, 2018 at 19:42 Christian Sarofeen ***@***.***> wrote:
Been running a script over and over today to check, only happened twice
out of many hundred so who knows. I think it should be fine.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9655 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFaWZQDO37R8N15StnMXOW7c3F3pcb-Fks5uI7xVgaJpZM4VZN8l>
.
|
|
Thanks for checking! :)
On Sat, Jul 21, 2018 at 23:26 Tongzhou Wang <[email protected]>
wrote:
… Did it hang at one of the joins? If so, which one was it?
On Sat, Jul 21, 2018 at 19:42 Christian Sarofeen ***@***.***>
wrote:
> Been running a script over and over today to check, only happened twice
> out of many hundred so who knows. I think it should be fine.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#9655 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AFaWZQDO37R8N15StnMXOW7c3F3pcb-Fks5uI7xVgaJpZM4VZN8l>
> .
>
|
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks look, but I'd rather get rid of the done_event unless it's necessary
| torch.manual_seed(seed) | ||
|
|
||
| # Do not wait for putting thread to join when this worker exits. Otherwise, | ||
| # this worker may always be waiting to put and doesn't check index_queue |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| if r is None: | ||
| # use done_event so that we can get faster exiting signal even if there | ||
| # are still indices in index_queue | ||
| if r is None or done_event.is_set(): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| self.index_queues = [multiprocessing.Queue() for _ in range(self.num_workers)] | ||
| self.worker_queue_idx = 0 | ||
| self.worker_result_queue = multiprocessing.SimpleQueue() | ||
| self.worker_result_queue = multiprocessing.Queue() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| if self.pin_memory or self.timeout > 0: | ||
| if self.pin_memory: | ||
| self.data_queue = queue.Queue() | ||
| if self.pin_memory: |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
torch/utils/data/dataloader.py
Outdated
| # removes pids no matter what | ||
| if not self.shutdown: | ||
| self.shutdown = True | ||
| self.done_event.set() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| time.sleep(self.sleep_sec) | ||
| if not self.sleeped: | ||
| time.sleep(self.sleep_sec) | ||
| self.sleeped = True |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnl has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| if r is None: | ||
| # use done_event so that we can get faster exiting signal even if there | ||
| # are still indices in index_queue | ||
| if r is None or done_event.is_set(): |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| self.index_queues = [multiprocessing.Queue() for _ in range(self.num_workers)] | ||
| self.worker_queue_idx = 0 | ||
| self.worker_result_queue = multiprocessing.SimpleQueue() | ||
| self.worker_result_queue = multiprocessing.Queue() |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
Summary: second trial of pytorch#7140 cc csarofeen Let's see if this works. It passes everything locally. Pull Request resolved: pytorch#9655 Differential Revision: D8940177 Pulled By: SsnL fbshipit-source-id: 8d6340fc9f7355c71e1e26b262da166402faa158
…ch#9655)" (pytorch#9804) Summary: This reverts commit 9ee5133. Pull Request resolved: pytorch#9804 Reviewed By: ezyang Differential Revision: D8987780 Pulled By: SsnL fbshipit-source-id: 75ad70b0b8d672d0b35235fa248b187be64b68e5
…orch#10366) Summary: pytorch#9655 Pull Request resolved: pytorch#10366 Differential Revision: D9237393 Pulled By: SsnL fbshipit-source-id: fabfad7f371ba33300098f6b885c0e3f26c3e14a
Summary: second trial of pytorch#7140 cc csarofeen Let's see if this works. It passes everything locally. Pull Request resolved: pytorch#9655 Differential Revision: D8940177 Pulled By: SsnL fbshipit-source-id: 8d6340fc9f7355c71e1e26b262da166402faa158
…ch#9655)" (pytorch#9804) Summary: This reverts commit 9ee5133. Pull Request resolved: pytorch#9804 Reviewed By: ezyang Differential Revision: D8987780 Pulled By: SsnL fbshipit-source-id: 75ad70b0b8d672d0b35235fa248b187be64b68e5
…orch#10366) Summary: pytorch#9655 Pull Request resolved: pytorch#10366 Differential Revision: D9237393 Pulled By: SsnL fbshipit-source-id: fabfad7f371ba33300098f6b885c0e3f26c3e14a
second trial of #7140
cc @csarofeen Let's see if this works. It passes everything locally.