Skip to content

Conversation

@highker
Copy link

@highker highker commented Dec 6, 2018

Resolves #14704

@highker highker added the oncall: jit Add this issue/PR to JIT oncall triage queue label Dec 6, 2018
@ezyang
Copy link
Contributor

ezyang commented Dec 10, 2018

@highker, if I have some general questions about the thread pool API, where should I put them? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: won't

@ezyang
Copy link
Contributor

ezyang commented Dec 10, 2018

https://github.com/pytorch/pytorch/blob/36363f2538358eedc17f45f6c24d52a4c313ab86/aten/src/ATen/core/thread_pool.h#L56

Non-atomic running_ variable looks fishy here. You're writing to it from the main ThreadPool owning thread, and reading it from sub-threads. The test for running_ in the loop is not protected by the lock. So it seems you have a data race here.

@ezyang
Copy link
Contributor

ezyang commented Dec 10, 2018

https://github.com/pytorch/pytorch/blob/36363f2538358eedc17f45f6c24d52a4c313ab86/aten/src/ATen/core/thread_pool.cpp#L130

So, if I understand correctly, our ivalue implementation isn't able to report error states, which is why the exception is swallowed here. Should it report errors? Seems like a good idea to me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some better documentation on ivalue, documenting the relationship between the locks and when callbacks are called. (This appears to be correct but I had to check the implementation of addCallback to see if callbacks are run with or without the lock.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand this comment upon reading it.

What I think you are trying to say, is that there is a race between when the thread that calls markCompleted on the future releases the lock, and when the callback is run (because the callback is run without synchronization and has to reacquire the lock). So the order of events is something like:

  1. We add callback to the future.
  2. In another thread, it is marked completed. We mark the future as completed,
  3. Main thread grabs the lock and does... something?

But what I still don't understand what the point of forcing the loop to run once is. For example, I think you could have fixed the bug simply by adding a lock acquire inside the callback. Could you tell me where I have reasoned incorrectly?

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would solve the problem, but I am wondering if it is the simplest solution to solve the problem. In any case, approving.

@highker
Copy link
Author

highker commented Dec 10, 2018

@ezyang

  • I think you are right, running_ looks a bit unsafe. cc: @ilia-cher
  • For exceptions, we handle them in Support error handling in forked threads #14523. That should address your concern.
  • For the documentation + comments in the interfaces of threadpool and future, I will try to make them verbose in a coming amend to this commit before I merge.
  • For general questions on threadpool/executors, ping me, @zdevito, or @ilia-cher maybe... we don't have a group or doc at the moment unfortunately.

@highker
Copy link
Author

highker commented Dec 10, 2018

@zdevito just realized we don't need workOnTasksUntilCompleted anymore; let me know if you agree on that.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ilia-cher
Copy link
Contributor

lgtm on atomic bool for running_

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@highker has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Dec 11, 2018
Summary:
Resolves #14704
Pull Request resolved: pytorch/pytorch#14833

Differential Revision: D13405211

Pulled By: highker

fbshipit-source-id: 8552d51eeb5d3af0ed66c461e5ddfeb9ae2926bd
@ezyang ezyang added the merged label Jun 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test after TestJit test_async_script times out

4 participants