Fix race condition in ThreadPool::workOnTasksUntilCompleted #14833

highker · 2018-12-06T01:55:50Z

Resolves #14704

ezyang · 2018-12-10T15:17:41Z

@highker, if I have some general questions about the thread pool API, where should I put them? :)

ezyang · 2018-12-10T15:19:36Z

aten/src/ATen/core/thread_pool.cpp

ezyang · 2018-12-10T15:23:34Z

https://github.com/pytorch/pytorch/blob/36363f2538358eedc17f45f6c24d52a4c313ab86/aten/src/ATen/core/thread_pool.h#L56

Non-atomic running_ variable looks fishy here. You're writing to it from the main ThreadPool owning thread, and reading it from sub-threads. The test for running_ in the loop is not protected by the lock. So it seems you have a data race here.

ezyang · 2018-12-10T15:28:30Z

https://github.com/pytorch/pytorch/blob/36363f2538358eedc17f45f6c24d52a4c313ab86/aten/src/ATen/core/thread_pool.cpp#L130

So, if I understand correctly, our ivalue implementation isn't able to report error states, which is why the exception is swallowed here. Should it report errors? Seems like a good idea to me...

ezyang · 2018-12-10T15:30:47Z

aten/src/ATen/core/thread_pool.cpp

I think we need some better documentation on ivalue, documenting the relationship between the locks and when callbacks are called. (This appears to be correct but I had to check the implementation of addCallback to see if callbacks are run with or without the lock.)

ezyang · 2018-12-10T15:38:49Z

aten/src/ATen/core/thread_pool.cpp

I didn't understand this comment upon reading it.

What I think you are trying to say, is that there is a race between when the thread that calls markCompleted on the future releases the lock, and when the callback is run (because the callback is run without synchronization and has to reacquire the lock). So the order of events is something like:

We add callback to the future.

In another thread, it is marked completed. We mark the future as completed,

Main thread grabs the lock and does... something?

But what I still don't understand what the point of forcing the loop to run once is. For example, I think you could have fixed the bug simply by adding a lock acquire inside the callback. Could you tell me where I have reasoned incorrectly?

ezyang

This seems like it would solve the problem, but I am wondering if it is the simplest solution to solve the problem. In any case, approving.

highker · 2018-12-10T18:54:56Z

@ezyang

I think you are right, running_ looks a bit unsafe. cc: @ilia-cher
For exceptions, we handle them in Support error handling in forked threads #14523. That should address your concern.
For the documentation + comments in the interfaces of threadpool and future, I will try to make them verbose in a coming amend to this commit before I merge.
For general questions on threadpool/executors, ping me, @zdevito, or @ilia-cher maybe... we don't have a group or doc at the moment unfortunately.

highker · 2018-12-10T21:38:18Z

@zdevito just realized we don't need workOnTasksUntilCompleted anymore; let me know if you agree on that.

facebook-github-bot

@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ilia-cher · 2018-12-10T21:42:07Z

lgtm on atomic bool for running_

facebook-github-bot

@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Resolves #14704 Pull Request resolved: pytorch/pytorch#14833 Differential Revision: D13405211 Pulled By: highker fbshipit-source-id: 8552d51eeb5d3af0ed66c461e5ddfeb9ae2926bd

highker requested review from ezyang, jamesr66a and zdevito December 6, 2018 01:55

highker added the oncall: jit Add this issue/PR to JIT oncall triage queue label Dec 6, 2018

ezyang reviewed Dec 10, 2018

View reviewed changes

aten/src/ATen/core/thread_pool.cpp Outdated

Copy link

Contributor

ezyang Dec 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: won't

ezyang reviewed Dec 10, 2018

View reviewed changes

ezyang approved these changes Dec 10, 2018

View reviewed changes

highker force-pushed the race branch from 36363f2 to 664e56e Compare December 10, 2018 21:36

facebook-github-bot reviewed Dec 10, 2018

View reviewed changes

highker force-pushed the race branch from 664e56e to c95867a Compare December 11, 2018 01:22

facebook-github-bot reviewed Dec 11, 2018

View reviewed changes

Fix race condition in ThreadPool::workOnTasksUntilCompleted

173d205

highker force-pushed the race branch from c95867a to 173d205 Compare December 11, 2018 07:04

facebook-github-bot reviewed Dec 11, 2018

View reviewed changes

facebook-github-bot closed this in 02d149b Dec 11, 2018

ezyang added the merged label Jun 25, 2019

Fix race condition in ThreadPool::workOnTasksUntilCompleted #14833

Fix race condition in ThreadPool::workOnTasksUntilCompleted #14833

Uh oh!

Conversation

highker commented Dec 6, 2018

Uh oh!

ezyang commented Dec 10, 2018

Uh oh!

ezyang Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

ezyang commented Dec 10, 2018

Uh oh!

ezyang commented Dec 10, 2018

Uh oh!

ezyang Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

ezyang Dec 10, 2018

Choose a reason for hiding this comment

Uh oh!

ezyang left a comment

Choose a reason for hiding this comment

Uh oh!

highker commented Dec 10, 2018

Uh oh!

highker commented Dec 10, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ilia-cher commented Dec 10, 2018

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants