Skip to content

move rebuild buckets from end of first iteration to beginning of second iteration#44326

Closed
zhaojuanmao wants to merge 5 commits intogh/zhaojuanmao/51/basefrom
gh/zhaojuanmao/51/head
Closed

move rebuild buckets from end of first iteration to beginning of second iteration#44326
zhaojuanmao wants to merge 5 commits intogh/zhaojuanmao/51/basefrom
gh/zhaojuanmao/51/head

Conversation

@zhaojuanmao
Copy link
Copy Markdown
Contributor

@zhaojuanmao zhaojuanmao commented Sep 8, 2020

Stack from ghstack:

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: D23583017

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

…nd iteration

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Sep 8, 2020
…nd iteration

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!

ghstack-source-id: 111611582
Pull Request resolved: #44326
@dr-ci
Copy link
Copy Markdown

dr-ci bot commented Sep 8, 2020

💊 CI failures summary and remediations

As of commit 4ea2a11 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_ge_config_simple_test (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Sep 14 21:34:08 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Sep 14 21:34:08 At: 
Sep 14 21:34:08   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 21:34:08   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 21:34:08  
Sep 14 21:34:08 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 14 21:34:08  
Sep 14 21:34:08 At: 
Sep 14 21:34:08   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 21:34:08   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 21:34:08  
Sep 14 21:34:08 [E request_callback_no_python.cpp:618] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Sep 14 21:34:08  
Sep 14 21:34:08 At: 
Sep 14 21:34:08   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(93): serialize 
Sep 14 21:34:08   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(145): serialize 
Sep 14 21:34:08  
Sep 14 21:34:09 ok (1.540s) 
Sep 14 21:34:10   test_return_future_remote (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.539s) 
Sep 14 21:34:12   test_return_local_rrefs (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.540s) 
Sep 14 21:34:13   test_rpc_profiling_remote_record_function (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.542s) 
Sep 14 21:34:15   test_rpc_return_rref (__main__.ProcessGroupRpcTestWithSpawn) ... ok (1.543s) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 20 times.

…ing of second iteration"

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Sep 8, 2020
…nd iteration

Pull Request resolved: #44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 111630708

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!
…ing of second iteration"

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!

[ghstack-poisoned]
…ing of second iteration"

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Sep 10, 2020
…nd iteration

Pull Request resolved: #44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 111790710

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!
@codecov
Copy link
Copy Markdown

codecov bot commented Sep 10, 2020

Codecov Report

Merging #44326 into gh/zhaojuanmao/51/base will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                    Coverage Diff                     @@
##           gh/zhaojuanmao/51/base   #44326      +/-   ##
==========================================================
+ Coverage                   68.00%   68.04%   +0.04%     
==========================================================
  Files                         384      382       -2     
  Lines                       49599    49468     -131     
==========================================================
- Hits                        33728    33661      -67     
+ Misses                      15871    15807      -64     
Impacted Files Coverage Δ
torch/nn/parallel/distributed.py 43.17% <100.00%> (+2.03%) ⬆️
torch/quantization/quantize_fx.py 90.56% <0.00%> (-3.78%) ⬇️
torch/utils/_benchmark/utils/timer.py 85.18% <0.00%> (-3.71%) ⬇️
torch/quantization/fuse_modules.py 96.29% <0.00%> (-1.86%) ⬇️
torch/quantization/__init__.py 85.71% <0.00%> (-1.79%) ⬇️
torch/_tensor_str.py 86.07% <0.00%> (-1.69%) ⬇️
torch/testing/_internal/expecttest.py 77.55% <0.00%> (-1.03%) ⬇️
torch/quantization/fx/quantize.py 96.31% <0.00%> (-0.89%) ⬇️
torch/quantization/fx/fusion_patterns.py 87.20% <0.00%> (-0.89%) ⬇️
torch/autograd/__init__.py 84.84% <0.00%> (-0.87%) ⬇️
... and 27 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c68a99b...4ea2a11. Read the comment docs.

Copy link
Copy Markdown
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would I be correct if assume the motivation is to rebuild buckets after the backward pass, to avoid increasing peak memory consumption?


def forward(self, *inputs, **kwargs):
# Calling _rebuild_buckets before forward compuation,
# It may allocate new buckets before deallocating old buckets
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may allocate new buckets before deallocating old buckets

Why? at this point, looks like old buckets are no longer necessary. Is it possible to first delete old buckets and then allocate new buckets?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grad points to old buckets, we can not simply delete old buckets as users may not want to delete the grads

@zhaojuanmao
Copy link
Copy Markdown
Contributor Author

yes, memory usage is low after backward pass and before forward computation, rebuildBuckets right before forward computation can save peak memory usage if we point grad to bucket_views later on

…ing of second iteration"

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Sep 14, 2020
…nd iteration

Pull Request resolved: #44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490

Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in f5d231d.

xuzhao9 pushed a commit that referenced this pull request Sep 18, 2020
…nd iteration (#44326)

Summary:
Pull Request resolved: #44326

Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration
ghstack-source-id: 112011490

Test Plan: unit tests

Reviewed By: mrshenli

Differential Revision: D23583017

fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
@facebook-github-bot facebook-github-bot deleted the gh/zhaojuanmao/51/head branch September 19, 2020 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants