-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Fix SyncBatchNorm running var update issue #22248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@mrshenli pytorch_xla_linux_trusty_py3_6_gcc5_4_build could retest again. |
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing! Can you add a test in test/test_distributed.py to cover different input sizes?
| CUDA: batch_norm_elemt_cuda | ||
|
|
||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor) | ||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this cause any backward compatibility issue? It's true that this method is only used in torch/nn/modules/_functions.py in PyTorch, but we do expose it as an API. I am a little worried it is used in other projects. @gchanan is it OK to break BC here? Or should we keep both API where a single count arg is automatically expanded to a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy that, and is it possible change const Tensor& counts to const ArrayRef<index_t> counts? I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @hux999 is trying to fix Tensor counts issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the best way of convert ArrayRef<index_t> to a pytorch Tensor in c++.
Can you try using the from_blob API (learnt from @yf225 ).
See forum post
Use IntArrayRef instead of Tensor for the type of counts
|
@pytorchbot retest this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
…uple of ints, but found element of type list at pos 1
|
We have found the reason, and trying to fix that. |
|
@mrshenli are there some ci cache issue? Jul 01 06:22:58 ======================================================================
Jul 01 06:22:58 ERROR: test_clip_grad_norm (__main__.TestNN)
Jul 01 06:22:58 ----------------------------------------------------------------------
Jul 01 06:22:58 Traceback (most recent call last):
Jul 01 06:22:58 File "test_nn.py", line 1829, in test_clip_grad_norm
Jul 01 06:22:58 norm = clip_grad_norm_(l.parameters(), max_norm, norm_type=norm_type)
Jul 01 06:22:58 File "/opt/conda/lib/python3.6/site-packages/torch/nn/utils/clip_grad.py", line 36, in clip_grad_norm_
Jul 01 06:22:58 clip_coef = torch.tensor(max_norm, device=device) / (total_norm + 1e-6)
Jul 01 06:22:58 UnboundLocalError: local variable 'device' referenced before assignment
Jul 01 06:22:58
Jul 01 06:22:58 ---------------------------------------------------------------------- |
|
@unlimblue that's not your fault. Other PR (e.g., #22392) also hit this error. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
| CUDA: batch_norm_elemt_cuda | ||
|
|
||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int count) -> (Tensor, Tensor) | ||
| - func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, int[] counts) -> (Tensor, Tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is still BC-breaking. As mentioned above, can you try keep both the original one and the new one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli you should probably also deprecate the old one so we can get rid of it in the future.
|
There is an issue on the test error: #22052. Ignore it. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
mrshenli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing!
Summary: ## Fix pytorch/pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch/pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
Summary: ## Fix pytorch#22192 + change signature: `func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)` + change cuda & cuda head ```cuda std::tuple<Tensor, Tensor> batch_norm_gather_stats_cuda(const Tensor& self, const Tensor& mean, const Tensor& invstd, const Tensor& running_mean, const Tensor& running_var, double momentum, double epsilon, int64_t count) { const Tensor& running_var, double momentum, double epsilon, const Tensor& counts) ``` + change python interface ```python class SyncBatchNorm(Function): def forward(self, input, weight, bias, running_mean, running_var, eps, momentum, process_group, world_size): ... ``` Pull Request resolved: pytorch#22248 Differential Revision: D16002146 Pulled By: mrshenli fbshipit-source-id: 9007e83928267b89df4d3847aabfbdb63e456956
Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Summary: Previous PR pytorch#22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: pytorch#36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Summary: Previous PR #22248 which provides support for variadic batch size across processes doesn't account the mean_dy/mean_dy_xmu on backward path, which produces wrong dgrad. Pull Request resolved: #36382 Differential Revision: D20984446 Pulled By: ngimel fbshipit-source-id: 80066eee83760b275d61e2cdd4e86facca5577fd
Fix #22192
func: batch_norm_gather_stats(Tensor input, Tensor mean, Tensor invstd, Tensor? running_mean, Tensor? running_var, float momentum, float eps, Tensor counts) -> (Tensor, Tensor)