Delay reduction of unused parameters until first autograd hook is called #22219

pietern · 2019-06-25T19:10:45Z

Reduction of gradients for unused parameters should happen as soon as
possible, because they potentially block reduction of gradients for
used parameters. This used to happen instantly when
prepare_for_backward was called and it found parameters that didn't
contribute. This meant that if you have a model with unused
parameters, and you want to discard the model output (i.e. not call
backward on some loss), reduction of the gradients of those unused
parameters would have been kicked off, and you'd see an error the next
time you called forward.

In this commit, this original approach is slightly changed to delay
reduction of the gradients of those unused parameters until the first
autograd hook is called. This means that you can now discard the model
output regardless of the model having unused parameters or not.

This is a prerequisite for making the find_unused_parameters
argument to DDP default to True.

Reduction of gradients for unused parameters should happen as soon as possible, because they potentially block reduction of gradients for used parameters. This used to happen instantly when `prepare_for_backward` was called and it found parameters that didn't contribute. This meant that if you have a model with unused parameters, and you want to discard the model output (i.e. not call backward on some loss), reduction of the gradients of those unused parameters would have been kicked off, and you'd see an error the next time you called `forward`. In this commit, this original approach is slightly changed to delay reduction of the gradients of those unused parameters until the first autograd hook is called. This means that you can now discard the model output regardless of the model having unused parameters or not. This is a prerequisite for making the `find_unused_parameters` argument to DDP default to `True`.

mrshenli

It breaks test_no_used_parameters as no post hook will be called at all. Seems we need to add another special case for it when all params are unused, or is it possible to use the queue_callback to register it upfront?

torch/csrc/distributed/c10d/reducer.cpp

pietern · 2019-06-26T11:11:51Z

Regarding test_no_used_parameters -- I think we should nuke it. I added it to test the corner case of find_unused_parameters=True, but it implies that you cannot discard the model output ever. This is not very practical if you want to only compute the grad through a DDP model instead of accumulating the grads w.r.t. the model parameters. If we want to make find_unused_parameters=True the default (per some ad hoc discussions between @soumith, you, and me), then we need to wait for a signal that you want to compute and reduce gradients by calling backward.

pietern · 2019-06-26T13:54:19Z

@pytorchbot retest this please

pietern · 2019-06-26T17:48:18Z

@pytorchbot retest this please

pietern · 2019-06-27T09:56:34Z

After checking in with CircleCI it is clear that the error for pytorch_linux_trusty_py3_6_gcc5_4_test is a false negative. There is a relationship between the first try and subsequent tries kickstarted by @pytorchbot that make this fail before even running the job.

facebook-github-bot

@pietern is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@pietern has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-06-27T22:43:24Z

@pietern merged this pull request in 7a40412.

…led (pytorch#22219) Summary: Reduction of gradients for unused parameters should happen as soon as possible, because they potentially block reduction of gradients for used parameters. This used to happen instantly when `prepare_for_backward` was called and it found parameters that didn't contribute. This meant that if you have a model with unused parameters, and you want to discard the model output (i.e. not call backward on some loss), reduction of the gradients of those unused parameters would have been kicked off, and you'd see an error the next time you called `forward`. In this commit, this original approach is slightly changed to delay reduction of the gradients of those unused parameters until the first autograd hook is called. This means that you can now discard the model output regardless of the model having unused parameters or not. This is a prerequisite for making the `find_unused_parameters` argument to DDP default to `True`. Pull Request resolved: pytorch#22219 Differential Revision: D16028698 Pulled By: pietern fbshipit-source-id: c6aec2cd39c4a77746495d9cb1c9fb9c5ac61983

pietern added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jun 25, 2019

pietern requested review from apaszke and mrshenli as code owners June 25, 2019 19:10

pietern force-pushed the reducer-reduce-on-backward branch from 392b3c8 to 216c384 Compare June 25, 2019 19:39

mrshenli reviewed Jun 26, 2019

View reviewed changes

torch/csrc/distributed/c10d/reducer.cpp Show resolved Hide resolved

pietern added 2 commits June 26, 2019 13:00

Remove test_no_used_parameters

b625136

Update prepare_for_backward comment

a288c74

mrshenli approved these changes Jun 26, 2019

View reviewed changes

facebook-github-bot reviewed Jun 27, 2019

View reviewed changes

Use new style struct initializer

dfd892f

facebook-github-bot reviewed Jun 27, 2019

View reviewed changes

facebook-github-bot closed this in 7a40412 Jun 27, 2019

facebook-github-bot added the merged label Jun 27, 2019

pietern deleted the reducer-reduce-on-backward branch June 28, 2019 10:19

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delay reduction of unused parameters until first autograd hook is called #22219

Delay reduction of unused parameters until first autograd hook is called #22219

Uh oh!

pietern commented Jun 25, 2019

Uh oh!

mrshenli left a comment

Uh oh!

Uh oh!

pietern commented Jun 26, 2019

Uh oh!

pietern commented Jun 26, 2019

Uh oh!

pietern commented Jun 26, 2019

Uh oh!

pietern commented Jun 27, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jun 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Delay reduction of unused parameters until first autograd hook is called #22219

Delay reduction of unused parameters until first autograd hook is called #22219

Uh oh!

Conversation

pietern commented Jun 25, 2019

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pietern commented Jun 26, 2019

Uh oh!

pietern commented Jun 26, 2019

Uh oh!

pietern commented Jun 26, 2019

Uh oh!

pietern commented Jun 27, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants