-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix race condition in ThreadPool::workOnTasksUntilCompleted #14833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@highker, if I have some general questions about the thread pool API, where should I put them? :) |
aten/src/ATen/core/thread_pool.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: won't
|
Non-atomic |
|
So, if I understand correctly, our ivalue implementation isn't able to report error states, which is why the exception is swallowed here. Should it report errors? Seems like a good idea to me... |
aten/src/ATen/core/thread_pool.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some better documentation on ivalue, documenting the relationship between the locks and when callbacks are called. (This appears to be correct but I had to check the implementation of addCallback to see if callbacks are run with or without the lock.)
aten/src/ATen/core/thread_pool.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't understand this comment upon reading it.
What I think you are trying to say, is that there is a race between when the thread that calls markCompleted on the future releases the lock, and when the callback is run (because the callback is run without synchronization and has to reacquire the lock). So the order of events is something like:
- We add callback to the future.
- In another thread, it is marked completed. We mark the future as completed,
- Main thread grabs the lock and does... something?
But what I still don't understand what the point of forcing the loop to run once is. For example, I think you could have fixed the bug simply by adding a lock acquire inside the callback. Could you tell me where I have reasoned incorrectly?
ezyang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it would solve the problem, but I am wondering if it is the simplest solution to solve the problem. In any case, approving.
|
|
@zdevito just realized we don't need |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
lgtm on atomic bool for running_ |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Resolves #14704 Pull Request resolved: pytorch/pytorch#14833 Differential Revision: D13405211 Pulled By: highker fbshipit-source-id: 8552d51eeb5d3af0ed66c461e5ddfeb9ae2926bd
Resolves #14704