Skip to content

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Oct 14, 2019

Stack from ghstack:

  1. If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
    on the local autograd engine.
  2. Add an exit_on_error mode for the local autograd engine, where the
    computation stops if we see an error.
  3. Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.
  4. Added a clean_shutdown parameter to dist_init to support test cases which can't shutdown cleanly across all nodes (ex: simulating node failures).

Differential Revision: D17916844

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 14, 2019
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

ghstack-source-id: 91886719
Pull Request resolved: #27940
1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 15, 2019
Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 91913393

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
@pritamdamania87
Copy link
Contributor Author

@ezyang I think your proposal of just setting an error on the graph task would be much better. Although, in general there is an issue with this PR when exit_on_error is true. The problem is that with exit_on_error, engine::Execute() returns to the user and this in turn destroys the GraphTask object. Although, there could still be NodeTask objects in the ready_queue which are holding a reference to the GraphTask (GraphTask*). When these tasks eventually execute, we would have a segfault. There are a few ways to fix this:

  1. NodeTask holds a shared_ptr to the GraphTask keeping it alive as a result. This would work, but I'm not sure about the performance overhead here because the autograd engine might create a lot of NodeTasks.
  2. We drain all the NodeTasks from the ready_queue before we return from engine::Execute. I'm not sure how to do this currently since the ready_queue is not indexed in anyway by a GraphTask and has a bunch of NodeTasks from different GraphTasks.
  3. GraphTasks holds a list of NodeTasks for that GraphTask. When the GraphTask object is destroyed, it marks all the NodeTasks in its list with a flag which indicates these NodeTasks would not be executed.

@ezyang
Copy link
Contributor

ezyang commented Oct 16, 2019

Yes, this makes sense. We need to implement some sort of orderly shutdown. One thing I observe is that we should drain the ready queues no matter what. However, as you point out in (2), because there may be jobs from multiple graph tasks in the ready queue, so we can't conveniently drain them. This suggests to me that tombstoning the GraphTask is some way is the right thing to do (1) (and (3) seems annoying to me). This could be done either by making everything a shared_ptr, as you suggest, or you could have just the main thread have a shared_ptr and the other threads have weak_ptr (expired graph task means you don't have to run the node task anymore).

I am wondering now how the autograd engine handles this sort of situation today, even in the absence of distributed autograd. It would probably be worth seeing what happens here. Maybe @albanD knows

@pritamdamania87
Copy link
Contributor Author

I think the shared_ptr + weak_ptr approach sounds best, I'll implement that in this PR.

I am wondering now how the autograd engine handles this sort of situation today, even in the absence of distributed autograd. It would probably be worth seeing what happens here. Maybe @albanD knows

This works in the autograd engine today since it doesn't return until all outstanding tasks are done and hence there is a guarantee that no NodeTasks are in the queue for that GraphTask. We introduced exit_on_error in this PR and that breaks the assumption that no NodeTasks are in the queue for that GraphTask.

@ezyang
Copy link
Contributor

ezyang commented Oct 17, 2019

This works in the autograd engine today since it doesn't return until all outstanding tasks are done and hence there is a guarantee that no NodeTasks are in the queue for that GraphTask. We introduced exit_on_error in this PR and that breaks the assumption that no NodeTasks are in the queue for that GraphTask.

Makes sense!

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 19, 2019
Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92245816

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 19, 2019
Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92246771

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
@rohan-varma rohan-varma self-requested a review October 22, 2019 00:25
@pritamdamania87
Copy link
Contributor Author

pritamdamania87 commented Oct 22, 2019

@ezyang I've implemented the weak_ptr approach in this PR, could you take another look? Thanks!

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 23, 2019
Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92428140

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
@ezyang
Copy link
Contributor

ezyang commented Oct 23, 2019

I feel we have hit some sort of GitHub bug. I can see engine.cpp in the list of files changed but GitHub claims there are no changes. And when I look at the base commit and the head commit, I do see the change. Maybe you should just close this PR and export to a new one.

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 23, 2019
Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92470121

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

bool graph_task_completed(const GraphTask& graph_task) {
return graph_task.outstanding_tasks_.load() == 0 ||
(graph_task.exit_on_error_ && graph_task.has_error_.load());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I am wondering... when should we ever NOT exit on error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I didn't want to change the behavior of the current autograd engine and hence I added this logic. I thought there was a good reason we continue executing other tasks even though we hit an error for one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I think it's an oversight in the current engine. @albanD, @apaszke any ideas?

@ezyang
Copy link
Contributor

ezyang commented Oct 23, 2019

This looks good. There are some minor blocking things. There is a more major question of whether or not the asserted weak pointer lock accesses are sound; in the comments above I've suggested api changes that could make them provably sound. I am also curious why testing if graph task is not nullptr is no longer sufficient to tell if you are a reentrant thread; maybe you just made this change for clarity purposes? Would be good to know.

@pritamdamania87
Copy link
Contributor Author

I am also curious why testing if graph task is not nullptr is no longer sufficient to tell if you are a reentrant thread; maybe you just made this change for clarity purposes?

Yes, I just felt it adding a reentrant_thread bool made it easier to read the code.

1) If we receive an error for outstanding rpcs in distributed autograd, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
3) Use std::weak_ptr for GraphTasks within NodeTasks. This would help in avoiding executing NodeTasks whose associated GraphTask has expired.
4) Added a `clean_shutdown` parameter to `dist_init` to support test cases which can't shutdown cleanly across all nodes (ex: simulating node failures).

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Oct 25, 2019
Pull Request resolved: #27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377

Differential Revision: [D17916844](https://our.internmc.facebook.com/intern/diff/D17916844/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 1322daa.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/14/head branch October 28, 2019 22:18
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Pull Request resolved: pytorch#27940

1) If we receive an error for outstanding rpcs, we enqueue an appropriate error
on the local autograd engine.
2) Add an `exit_on_error` mode for the local autograd engine, where the
computation stops if we see an error.
ghstack-source-id: 92603377

Test Plan: Added unit tests to test failures.

Differential Revision: D17916844

fbshipit-source-id: 199a7832f1033c36a9bbcc1e80d86576c04965d0
Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pritamdamania87 just reviving this PR as it was simpler to check how TORCH_INTERNAL_ASSERT(!reentrant_thread); could be triggered. Other internal assert might need to be removed and properly handled as well.

// better.
set_device(device);
thread_main(nullptr);
std::shared_ptr<GraphTask> graph_task = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you explitly create a graph_task here? Why not just construct it on the fly like you do in pushShutdownTask to create the NodeTask?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid doing this, I guess what happened was in a previous version of the PR the GraphTask was passed in as a non-const ref, so we couldn't just pass nullptr here. But with a const-ref, it should be fine passing in nullptr.

if (!(local_graph_task = task.base_.lock())) {
// Reentrant thread's graph task should not expire since we hold a
// reference to it in this method.
TORCH_INTERNAL_ASSERT(!reentrant_thread);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be triggered when there are two graph_tasks being executed at the same time. One is reentrant and this worker is currently the reentrant one. But the other graph_task got killed and so local_graph_task, which points a completely different graph task than graph_task, can actually be invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, the assumption we had here was that the reentrant thread was only executing tasks from the GraphTask passed in, but this is not true when I look at the code more closely. I'm guessing just removing this check would suffice?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just want to ignore such tasks yes, that would correspond to "draining" that queue from the work associated with this graph_task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants