Make grad point to bucket buffer in DDP to save memory usage#41954
Make grad point to bucket buffer in DDP to save memory usage#41954zhaojuanmao wants to merge 8 commits intogh/zhaojuanmao/49/basefrom
Conversation
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) ghstack-source-id: 108384696 Pull Request resolved: #41954
💊 CI failures summary and remediationsAs of commit 2082a5a (more details on the Dr. CI page):
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
Pull Request resolved: #41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 108461312 Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/)
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
Pull Request resolved: #41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 108498988 Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/)
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
Pull Request resolved: #41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 108579189 Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/)
|
Would be nice if we could also share some memory saving results as part of the PR description. |
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
Pull Request resolved: #41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) ghstack-source-id: 109198898
albanD
left a comment
There was a problem hiding this comment.
Looks mostly good.
I would add a comment here as well:
pytorch/torch/csrc/autograd/functions/accumulate_grad.h
Lines 176 to 177 in c660d2a
|
@albanD thanks for your review. I will add comments in accumulate_grad.h. But be noted, if grads are mutated in place, DDP can save memory; if grads are mutated out of place somehow, although DDP can not save memory, it is still working at it is today (checking grad is alias of bucket buffer or not, if not, copying grad to bucket buffer) |
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Test Plans: For roberta_base model with 1GB parameters, this diff can save 1GB memory. DDP without this diff, peak allocated memory during training loop is 8250 MB; DDP with this diff, peak allocated memory during training loop is 7182 MB; Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
Pull Request resolved: #41954 Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 109704136 Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/)
|
Yes it will still give the right result I agree but it will defeat the purpose of this optimization (silently!). So it might be hard to detect that changes there actually break this optimization. Hence my request to add a comment there to make people aware that this can happen. |
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Test Plans: For roberta_base model with 1GB parameters, this diff can save 1GB memory. DDP without this diff, peak allocated memory during training loop is 8250 MB; DDP with this diff, peak allocated memory during training loop is 7182 MB; Differential Revision: [D22707857](https://our.internmc.facebook.com/intern/diff/D22707857/) [ghstack-poisoned]
…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]
Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 111827320 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/)
…ing of second iteration" Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)! [ghstack-poisoned]
…nd iteration Pull Request resolved: #44326 Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112011490 Differential Revision: [D23583017](https://our.internmc.facebook.com/intern/diff/D23583017/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23583017/)!
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
Pull Request resolved: #44330 Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well ghstack-source-id: 112022404 Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/)
…nd iteration (#44326) Summary: Pull Request resolved: #44326 Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112011490 Test Plan: unit tests Reviewed By: mrshenli Differential Revision: D23583017 fbshipit-source-id: ef67f79437a820d9b5699b651803622418499a83
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
[test all] Pull Request resolved: #44330 Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well ghstack-source-id: 112185672 Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/)
…g of second iteration [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112011490 Differential Revision: [D23735185](https://our.internmc.facebook.com/intern/diff/D23735185/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23735185/)! [ghstack-poisoned]
…to beginning of second iteration" [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration Differential Revision: [D23735185](https://our.internmc.facebook.com/intern/diff/D23735185/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23735185/)! [ghstack-poisoned]
…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]
[test all] Pull Request resolved: #44344 reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. ghstack-source-id: 112194326 Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23588186/)!
…emory usage" reland #41954 Add one argument in DDP API to enable/disable letting grads pointing to views. When it is disabled, behavior is the same as DDP right now; when it is enabled, Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage. In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we made changes in #41283. Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to keep grad undefined for unused parameters. Differential Revision: [D23588186](https://our.internmc.facebook.com/intern/diff/D23588186/) [ghstack-poisoned]
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
…to beginning of second iteration" [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration Differential Revision: [D23735185](https://our.internmc.facebook.com/intern/diff/D23735185/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23735185/)! [ghstack-poisoned]
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/) [ghstack-poisoned]
[test all] Pull Request resolved: #44330 Part of relanding PR #41954, this refactor is to seperate intialize_bucket_views and populate_bucket_views_out, as they are doing different things and called by different callsites as well ghstack-source-id: 112243783 Differential Revision: [D23583347](https://our.internmc.facebook.com/intern/diff/D23583347/)
…to beginning of second iteration" [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration Differential Revision: [D23735185](https://our.internmc.facebook.com/intern/diff/D23735185/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23735185/)! [ghstack-poisoned]
…g of second iteration Pull Request resolved: #44798 [test all] Update for relanding: in ddp.join(), moved _rebuild_buckets from end of backward to beginning of forward as well. Part of relanding PR #41954, this refactoring is to move rebuild_buckets call from end of first iteration to beginning of second iteration ghstack-source-id: 112244222 ghstack-source-id: 112244222 Differential Revision: [D23735185](https://our.internmc.facebook.com/intern/diff/D23735185/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23735185/)!
Stack from ghstack:
Make both variable.grad() and grad in distautograd context point to bucket buffer in DDP to save memory usage.
In this case, grad will be view of bucket buffer tensors, in order to make it compatiable with optimizer.zero_grad(), we
made changes in #41283.
Also be noted that we can not make variable.grad() pointing to bucket buffer during construction time, because we want to
keep grad undefined for unused parameters.
Test Plans:
For roberta_base model with ~1GB parameters, peak memory dropped ~1GB (8250MB-7183MB). Per iteration latency (0.982s ->0.909s), 8% speed up; will rerun a few times to confirm the speed up
For resnet model with ~97M parameters, peak memory dropped ~100MB (3089MB -> 2988MB). Per iteration latency has no change (0.122s -> 0.123s)
Differential Revision: D22707857