Improve error handling for distributed autograd engine. #27940

pritamdamania87 · 2019-10-14T21:44:07Z

Stack from ghstack:

Improve error handling for distributed autograd engine. #27940 Improve error handling for distributed autograd engine.

If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
on the local autograd engine.
Add an exit_on_error mode for the local autograd engine, where the
computation stops if we see an error.
Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.
Added a clean_shutdown parameter to dist_init to support test cases which can't shutdown cleanly across all nodes (ex: simulating node failures).

Differential Revision: D17916844

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) ghstack-source-id: 91886719 Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 91913393 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

torch/csrc/autograd/engine.h

torch/csrc/autograd/engine.cpp

torch/csrc/distributed/autograd/context/dist_autograd_context.cpp

test/dist_autograd_test.py

pritamdamania87 · 2019-10-15T23:29:07Z

@ezyang I think your proposal of just setting an error on the graph task would be much better. Although, in general there is an issue with this PR when exit_on_error is true. The problem is that with exit_on_error, engine::Execute() returns to the user and this in turn destroys the GraphTask object. Although, there could still be NodeTask objects in the ready_queue which are holding a reference to the GraphTask (GraphTask*). When these tasks eventually execute, we would have a segfault. There are a few ways to fix this:

NodeTask holds a shared_ptr to the GraphTask keeping it alive as a result. This would work, but I'm not sure about the performance overhead here because the autograd engine might create a lot of NodeTasks.
We drain all the NodeTasks from the ready_queue before we return from engine::Execute. I'm not sure how to do this currently since the ready_queue is not indexed in anyway by a GraphTask and has a bunch of NodeTasks from different GraphTasks.
GraphTasks holds a list of NodeTasks for that GraphTask. When the GraphTask object is destroyed, it marks all the NodeTasks in its list with a flag which indicates these NodeTasks would not be executed.

ezyang · 2019-10-16T15:17:16Z

Yes, this makes sense. We need to implement some sort of orderly shutdown. One thing I observe is that we should drain the ready queues no matter what. However, as you point out in (2), because there may be jobs from multiple graph tasks in the ready queue, so we can't conveniently drain them. This suggests to me that tombstoning the GraphTask is some way is the right thing to do (1) (and (3) seems annoying to me). This could be done either by making everything a shared_ptr, as you suggest, or you could have just the main thread have a shared_ptr and the other threads have weak_ptr (expired graph task means you don't have to run the node task anymore).

I am wondering now how the autograd engine handles this sort of situation today, even in the absence of distributed autograd. It would probably be worth seeing what happens here. Maybe @albanD knows

pritamdamania87 · 2019-10-16T17:58:44Z

I think the shared_ptr + weak_ptr approach sounds best, I'll implement that in this PR.

I am wondering now how the autograd engine handles this sort of situation today, even in the absence of distributed autograd. It would probably be worth seeing what happens here. Maybe @albanD knows

This works in the autograd engine today since it doesn't return until all outstanding tasks are done and hence there is a guarantee that no NodeTasks are in the queue for that GraphTask. We introduced exit_on_error in this PR and that breaks the assumption that no NodeTasks are in the queue for that GraphTask.

ezyang · 2019-10-17T13:04:38Z

This works in the autograd engine today since it doesn't return until all outstanding tasks are done and hence there is a guarantee that no NodeTasks are in the queue for that GraphTask. We introduced exit_on_error in this PR and that breaks the assumption that no NodeTasks are in the queue for that GraphTask.

Makes sense!

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92245816 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92246771 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

pritamdamania87 · 2019-10-22T01:04:06Z

@ezyang I've implemented the weak_ptr approach in this PR, could you take another look? Thanks!

test/dist_utils.py

torch/csrc/autograd/engine.h

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92428140 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

ezyang · 2019-10-23T18:23:59Z

I feel we have hit some sort of GitHub bug. I can see engine.cpp in the list of files changed but GitHub claims there are no changes. And when I look at the base commit and the head commit, I do see the change. Maybe you should just close this PR and export to a new one.

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92470121 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

torch/csrc/autograd/engine.cpp

ezyang · 2019-10-23T19:24:52Z

torch/csrc/autograd/engine.cpp


+bool graph_task_completed(const GraphTask& graph_task) {
+  return graph_task.outstanding_tasks_.load() == 0 ||
+      (graph_task.exit_on_error_ && graph_task.has_error_.load());


Now I am wondering... when should we ever NOT exit on error?

Well, I didn't want to change the behavior of the current autograd engine and hence I added this logic. I thought there was a good reason we continue executing other tasks even though we hit an error for one.

TBH, I think it's an oversight in the current engine. @albanD, @apaszke any ideas?

torch/csrc/autograd/engine.cpp

ezyang · 2019-10-23T19:29:00Z

This looks good. There are some minor blocking things. There is a more major question of whether or not the asserted weak pointer lock accesses are sound; in the comments above I've suggested api changes that could make them provably sound. I am also curious why testing if graph task is not nullptr is no longer sufficient to tell if you are a reentrant thread; maybe you just made this change for clarity purposes? Would be good to know.

pritamdamania87 · 2019-10-23T23:09:48Z

I am also curious why testing if graph task is not nullptr is no longer sufficient to tell if you are a reentrant thread; maybe you just made this change for clarity purposes?

Yes, I just felt it adding a reentrant_thread bool made it easier to read the code.

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. 4) Added a `clean_shutdown` parameter to `dist_init` to support test cases which can't shutdown cleanly across all nodes (ex: simulating node failures). Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]

Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

facebook-github-bot · 2019-10-26T02:37:05Z

This pull request has been merged in 1322daa.

Summary: Pull Request resolved: pytorch#27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Test Plan: Added unit tests to test failures. Differential Revision: D17916844 fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0

albanD

@pritamdamania87 just reviving this PR as it was simpler to check how TORCH_INTERNAL_ASSERT(!reentrant_thread); could be triggered. Other internal assert might need to be removed and properly handled as well.

torch/csrc/autograd/engine.cpp

albanD · 2020-03-03T22:37:50Z

torch/csrc/autograd/engine.cpp

  // better.
  set_device(device);
-  thread_main(nullptr);
+  std::shared_ptr<GraphTask> graph_task = nullptr;


Why do you explitly create a graph_task here? Why not just construct it on the fly like you do in pushShutdownTask to create the NodeTask?

I think we can avoid doing this, I guess what happened was in a previous version of the PR the GraphTask was passed in as a non-const ref, so we couldn't just pass nullptr here. But with a const-ref, it should be fine passing in nullptr.

albanD · 2020-03-03T22:46:26Z

torch/csrc/autograd/engine.cpp

+    if (!(local_graph_task = task.base_.lock())) {
+      // Reentrant thread's graph task should not expire since we hold a
+      // reference to it in this method.
+      TORCH_INTERNAL_ASSERT(!reentrant_thread);


This could be triggered when there are two graph_tasks being executed at the same time. One is reentrant and this worker is currently the reentrant one. But the other graph_task got killed and so local_graph_task, which points a completely different graph task than graph_task, can actually be invalid.

This is correct, the assumption we had here was that the reentrant thread was only executing tasks from the GraphTask passed in, but this is not true when I look at the code more closely. I'm guessing just removing this check would suffice?

I think we just want to ignore such tasks yes, that would correspond to "draining" that queue from the work associated with this graph_task.

pritamdamania87 requested review from apaszke, mrshenli and pietern as code owners October 14, 2019 21:44

pritamdamania87 requested a review from ezyang October 14, 2019 21:44

ezyang reviewed Oct 15, 2019

View reviewed changes

torch/csrc/autograd/engine.h Outdated Show resolved Hide resolved

ezyang reviewed Oct 15, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

ezyang reviewed Oct 15, 2019

View reviewed changes

torch/csrc/distributed/autograd/context/dist_autograd_context.cpp Outdated Show resolved Hide resolved

ezyang reviewed Oct 15, 2019

View reviewed changes

torch/csrc/distributed/autograd/context/dist_autograd_context.cpp Outdated Show resolved Hide resolved

ezyang reviewed Oct 15, 2019

View reviewed changes

torch/csrc/distributed/autograd/context/dist_autograd_context.cpp Outdated Show resolved Hide resolved

mrshenli reviewed Oct 15, 2019

View reviewed changes

test/dist_autograd_test.py Show resolved Hide resolved

pritamdamania87 requested a review from albanD October 16, 2019 17:59

pritamdamania87 requested a review from ezyang October 19, 2019 02:02

rohan-varma self-requested a review October 22, 2019 00:25

ezyang reviewed Oct 22, 2019

View reviewed changes

test/dist_utils.py Outdated Show resolved Hide resolved

ezyang reviewed Oct 22, 2019

View reviewed changes

torch/csrc/autograd/engine.h Show resolved Hide resolved

pritamdamania87 requested a review from ezyang October 23, 2019 00:47

ezyang reviewed Oct 23, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

osalpekar mentioned this pull request Oct 23, 2019

Cleaning Up DistAutogradContext after set Timeout #28528

Closed

ezyang reviewed Oct 23, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

ezyang reviewed Oct 23, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

ezyang reviewed Oct 23, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Show resolved Hide resolved

ezyang reviewed Oct 23, 2019

View reviewed changes

torch/csrc/autograd/engine.cpp Outdated Show resolved Hide resolved

ezyang approved these changes Oct 24, 2019

View reviewed changes

pritamdamania87 mentioned this pull request Oct 25, 2019

Support remote for builtin operators in distributed autograd #28630

Closed

facebook-github-bot closed this in 1322daa Oct 25, 2019

facebook-github-bot added the merged label Oct 26, 2019

facebook-github-bot deleted the gh/pritamdamania87/14/head branch October 28, 2019 22:18

xush6528 mentioned this pull request Oct 31, 2019

Add enum type to rpc registry for consolidating RPC initialization code path #28628

Closed

albanD mentioned this pull request Feb 9, 2020

Inconsistent recovery from CUDA OOMs #18853

Closed

albanD reviewed Mar 3, 2020

View reviewed changes

mruberry added the Merged label Oct 28, 2020

Improve error handling for distributed autograd engine. #27940

Improve error handling for distributed autograd engine. #27940

Uh oh!

Conversation

pritamdamania87 commented Oct 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pritamdamania87 commented Oct 15, 2019

Uh oh!

ezyang commented Oct 16, 2019

Uh oh!

pritamdamania87 commented Oct 16, 2019

Uh oh!

ezyang commented Oct 17, 2019

Uh oh!

pritamdamania87 commented Oct 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ezyang commented Oct 23, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ezyang commented Oct 23, 2019

Uh oh!

pritamdamania87 commented Oct 23, 2019

Uh oh!

facebook-github-bot commented Oct 26, 2019

Uh oh!

albanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pritamdamania87 commented Oct 14, 2019 •

edited

Loading

pritamdamania87 commented Oct 22, 2019 •

edited

Loading