Skip to content

Conversation

@chenyangyu1988
Copy link
Contributor

Summary:
Add unit test to ensure no gradients sync when calling ddp.module(input), e.g not invoking prepare_for_backward

PyText is depending on DDP for data parallel distributed training. To support accumulate gradients locally before gradients sync, we are calling orig_model.forward instead of ddp_model.forward. Add a unit test to avoid changes break the assumption.

Differential Revision: D15263155

@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 8, 2019
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test. I feel this test should only temporarily live to make sure changes in PyTorch does not break your use cases. And it should retire when we have a better solution (e.g., a context manager as @apaszke suggested).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do 4 iterations to make sure that things are still correct after the first sync.

Copy link
Contributor

@mrshenli mrshenli May 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a comment to say that this would only work when there is a single model replica per process. Otherwise, _sync_params would broadcast params from replica 0 to others and erase accumulated grads.

…ut) (pytorch#20282)

Summary:
Pull Request resolved: pytorch#20282

Add unit test to ensure no gradients sync when calling ddp.module(input), e.g not invoking prepare_for_backward

PyText is depending on DDP for data parallel distributed training. To support accumulate gradients locally before gradients sync, we are calling orig_model.forward instead of ddp_model.forward. Add a unit test to avoid changes break the assumption.

Reviewed By: mrshenli

Differential Revision: D15263155

fbshipit-source-id: f5bd884616b2783064f383820dc6397f8608c3c6
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2019f6c.

facebook-github-bot pushed a commit that referenced this pull request May 10, 2019
Summary:
Pull Request resolved: #20351

This was broken because of a merge race between #20282 and the stack in #20236.

Cleaned up the test and comments a bit as well.

Differential Revision: D15292786

fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants