-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Improve error handling for distributed autograd engine. #27940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve error handling for distributed autograd engine. #27940
Conversation
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) ghstack-source-id: 91886719 Pull Request resolved: #27940
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 91913393 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
torch/csrc/distributed/autograd/context/dist_autograd_context.cpp
Outdated
Show resolved
Hide resolved
torch/csrc/distributed/autograd/context/dist_autograd_context.cpp
Outdated
Show resolved
Hide resolved
torch/csrc/distributed/autograd/context/dist_autograd_context.cpp
Outdated
Show resolved
Hide resolved
|
@ezyang I think your proposal of just setting an error on the graph task would be much better. Although, in general there is an issue with this PR when
|
|
Yes, this makes sense. We need to implement some sort of orderly shutdown. One thing I observe is that we should drain the ready queues no matter what. However, as you point out in (2), because there may be jobs from multiple graph tasks in the ready queue, so we can't conveniently drain them. This suggests to me that tombstoning the GraphTask is some way is the right thing to do (1) (and (3) seems annoying to me). This could be done either by making everything a I am wondering now how the autograd engine handles this sort of situation today, even in the absence of distributed autograd. It would probably be worth seeing what happens here. Maybe @albanD knows |
|
I think the shared_ptr + weak_ptr approach sounds best, I'll implement that in this PR.
This works in the autograd engine today since it doesn't return until all outstanding tasks are done and hence there is a guarantee that no NodeTasks are in the queue for that GraphTask. We introduced |
Makes sense! |
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92245816 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92246771 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
|
@ezyang I've implemented the weak_ptr approach in this PR, could you take another look? Thanks! |
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92428140 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
|
I feel we have hit some sort of GitHub bug. I can see engine.cpp in the list of files changed but GitHub claims there are no changes. And when I look at the base commit and the head commit, I do see the change. Maybe you should just close this PR and export to a new one. |
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92470121 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
|
|
||
| bool graph_task_completed(const GraphTask& graph_task) { | ||
| return graph_task.outstanding_tasks_.load() == 0 || | ||
| (graph_task.exit_on_error_ && graph_task.has_error_.load()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I am wondering... when should we ever NOT exit on error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I didn't want to change the behavior of the current autograd engine and hence I added this logic. I thought there was a good reason we continue executing other tasks even though we hit an error for one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
This looks good. There are some minor blocking things. There is a more major question of whether or not the asserted weak pointer lock accesses are sound; in the comments above I've suggested api changes that could make them provably sound. I am also curious why testing if graph task is not nullptr is no longer sufficient to tell if you are a reentrant thread; maybe you just made this change for clarity purposes? Would be good to know. |
Yes, I just felt it adding a |
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. 3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired. 4) Added a `clean_shutdown` parameter to `dist_init` to support test cases which can't shutdown cleanly across all nodes (ex: simulating node failures). Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/) [ghstack-poisoned]
Pull Request resolved: #27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
|
This pull request has been merged in 1322daa. |
Summary: Pull Request resolved: pytorch#27940 1) If we receive an error for outstanding rpcs, we enqueue an appropriate error on the local autograd engine. 2) Add an `exit_on_error` mode for the local autograd engine, where the computation stops if we see an error. ghstack-source-id: 92603377 Test Plan: Added unit tests to test failures. Differential Revision: D17916844 fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 just reviving this PR as it was simpler to check how TORCH_INTERNAL_ASSERT(!reentrant_thread); could be triggered. Other internal assert might need to be removed and properly handled as well.
| // better. | ||
| set_device(device); | ||
| thread_main(nullptr); | ||
| std::shared_ptr<GraphTask> graph_task = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you explitly create a graph_task here? Why not just construct it on the fly like you do in pushShutdownTask to create the NodeTask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can avoid doing this, I guess what happened was in a previous version of the PR the GraphTask was passed in as a non-const ref, so we couldn't just pass nullptr here. But with a const-ref, it should be fine passing in nullptr.
| if (!(local_graph_task = task.base_.lock())) { | ||
| // Reentrant thread's graph task should not expire since we hold a | ||
| // reference to it in this method. | ||
| TORCH_INTERNAL_ASSERT(!reentrant_thread); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be triggered when there are two graph_tasks being executed at the same time. One is reentrant and this worker is currently the reentrant one. But the other graph_task got killed and so local_graph_task, which points a completely different graph task than graph_task, can actually be invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, the assumption we had here was that the reentrant thread was only executing tasks from the GraphTask passed in, but this is not true when I look at the code more closely. I'm guessing just removing this check would suffice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we just want to ignore such tasks yes, that would correspond to "draining" that queue from the work associated with this graph_task.
Stack from ghstack:
on the local autograd engine.
exit_on_errormode for the local autograd engine, where thecomputation stops if we see an error.
clean_shutdownparameter todist_initto support test cases which can't shutdown cleanly across all nodes (ex: simulating node failures).Differential Revision: D17916844