add process_group in convert_sync_batchnorm #19240

zhangliliang · 2019-04-13T08:41:58Z

In line 508.
convert_sync_batchnorm is called recursively to convert the bn to syncbn, thus the process_group also should be passed in the function.

In line 508. convert_sync_batchnorm is called recursively to convert the bn to syncbn, thus the process_group also should be pass in the function.

ssnl

Nice catch!

ezyang · 2019-04-15T18:42:51Z

Thank you. Is there an easy way to test this case?

zhangliliang · 2019-04-17T15:18:31Z

Thank you. Is there an easy way to test this case?

In my opinion, process_group is used to restrict the group when synchronizing means and variances in the batchnorm. When a user adopts convert_sync_batchnorm to convert a nn.modules, he might expert the function would set all the SyncBatchNorm in the model to restrict their synchorized group.

Thus, it might to need to recursively pass the process_group in the module when the SyncBatchNorm is nested in a sub-module or sub-sub-module (e.g. resnet50 in torchvision.models).

A testcase might be use this fucntion to convert the bn to syncbn in resnet50, and then checkout whether all the 'syncbn' have been assigned a proper value of 'progress_group' .

ezyang · 2019-04-19T15:43:21Z

Thanks. Do you think you would have time to add this test?

zhangliliang · 2019-04-20T00:51:37Z

Thanks. Do you think you would have time to add this test?

Thanks for replying.

Do you mean that adding a testcase which like class PackedSequenceTest(TestCase) in test/test_nn.py.

If that so, I might try it on.

ssnl · 2019-04-20T15:09:18Z

@zhangliliang It would be adding a test method in test_distributed.py

zhangliliang · 2019-04-21T02:55:26Z

@zhangliliang It would be adding a test method in test_distributed.py

@ssnl Get it. I would try.

zhangliliang · 2019-04-21T15:16:32Z

I write an example to test the case.
It would exit while running the second assert, since the group_processing is not set correctly in the nn.SyncBatchNorm.convert_sync_batchnorm.

Do you consider whether it is a right testcase? @ssnl @ezyang
If so, I would try to re-organize it into test_distributed.py.

Thanks.

import torch
from torch import nn
import torch.distributed as dist
import torchvision.models as models
import copy


def convert_sync_batchnorm_fixed(module, process_group=None):

    module_output = module
    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm):
        module_output = torch.nn.SyncBatchNorm(module.num_features,
                                               module.eps, module.momentum,
                                               module.affine,
                                               module.track_running_stats,
                                               process_group)
        if module.affine:
            module_output.weight.data = module.weight.data.clone().detach()
            module_output.bias.data = module.bias.data.clone().detach()
        module_output.running_mean = module.running_mean
        module_output.running_var = module.running_var
        module_output.num_batches_tracked = module.num_batches_tracked
    for name, child in module.named_children():
        module_output.add_module(name, convert_sync_batchnorm_fixed(child, process_group))
    del module
    return module_output


world_size = 1
process_ids = 0

dist.init_process_group(backend='nccl', init_method="tcp://127.0.0.1:1234",
                                world_size=world_size, rank=process_ids)

process_group = torch.distributed.new_group([process_ids])
res50_model = models.resnet50()
res50_model_sync_ori = nn.SyncBatchNorm.convert_sync_batchnorm(copy.deepcopy(res50_model), process_group)
res50_model_sync_fixed = convert_sync_batchnorm_fixed(copy.deepcopy(res50_model), process_group)

process_group_sync_ori = res50_model_sync_ori.layer1[0].bn1.process_group
process_group_sync_fixed = res50_model_sync_fixed.layer1[0].bn1.process_group


assert(process_group_sync_fixed == process_group)

assert(process_group_sync_ori == process_group)

ezyang · 2019-04-22T14:25:53Z

The test looks good to me. If you add a little explanation, like what you wrote in your comment, to the code, that would be perfect. 👍

ssnl · 2019-04-22T14:28:39Z

@ezyang Do we always have torchvision install in CI? If not, the test should probably use a custom network.

ezyang · 2019-04-22T14:36:24Z

Yeah, we install torchvision, and there are existing tests which use it. Actually, this is probably changing soon cc @fmassa, but for now it shouldn't be a problem.

…using convert_sync_batchnorm.

ssnl · 2019-05-06T06:21:48Z

oh no we didn't include this 1.1!

ssnl · 2019-05-06T06:22:31Z

@pytorchbot rebase this please

ssnl · 2019-05-06T06:22:57Z

@pytorchbot merge this please

zhangliliang · 2019-05-06T06:30:38Z

Sorry for replying later. These days I occupied by some stuff.
I add a test case for this PR and test it in my computer.
Please check it out whether the code is wrote appropriately.

ssnl · 2019-05-06T06:32:23Z

@zhangliliang No worries. It's not your fault. Thanks for your contribution!

zhangliliang · 2019-05-06T06:34:44Z

@zhangliliang No worries. It's not your fault. Thanks for your contribution!

Thanks.~

zhangliliang · 2019-05-06T11:16:24Z

@ssnl
It seems that some checks were fail, since the torchvision is not sucessfully imported.
Do I need to handle it by removing the dependency of torchvision in the testcase?

ezyang · 2019-05-06T11:22:51Z

I haven't looked at the PR, but we have tests which are conditionally enabled depending on if torchvision is available; go take a look at them and copy the pattern.

…o patch-1

ezyang · 2019-05-06T13:00:12Z

It looks like there were more changes since the merge request, I will wait for tests.

zhangliliang · 2019-05-07T03:08:31Z

It looks like there were more changes since the merge request, I will wait for tests.

@ezyang
It seems that one error (binary_macos_libtorch_2.7_cpu_build) still exist, but I don't know how to deal with it. Could you give some ideas?

zhangliliang · 2019-05-07T11:36:39Z

All checks have passed, now.

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-05-07T16:03:56Z

@ezyang merged this pull request in f7a7868.

add process_group in convert_sync_batchnorm

15a4e1c

In line 508. convert_sync_batchnorm is called recursively to convert the bn to syncbn, thus the process_group also should be pass in the function.

ssnl approved these changes Apr 15, 2019

View reviewed changes

ezyang self-requested a review April 19, 2019 15:43

[NEW] Add a testcase for checking process_group in res50 model after …

e1fa123

…using convert_sync_batchnorm.

Merge remote-tracking branch 'origin/master' into HEAD

4f4040e

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label May 6, 2019

adam.shi added 4 commits May 6, 2019 20:02

[FIX] Fix the import problem when no torchvision in env.

18caf60

Merge branch 'master' of https://github.com/pytorch/pytorch into patch-1

ecf60c9

Merge branch 'patch-1' of https://github.com/zhangliliang/pytorch int…

c8957d8

…o patch-1

[FIX] Fix the flake8 issue.

97b3791

facebook-github-bot reviewed May 7, 2019

View reviewed changes

facebook-github-bot closed this in f7a7868 May 7, 2019

facebook-github-bot added the merged label May 7, 2019

ezyang added the open source label Jun 24, 2019

add process_group in convert_sync_batchnorm #19240

add process_group in convert_sync_batchnorm #19240

Uh oh!

Conversation

zhangliliang commented Apr 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Apr 15, 2019

Uh oh!

zhangliliang commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Apr 19, 2019

Uh oh!

zhangliliang commented Apr 20, 2019

Uh oh!

ssnl commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangliliang commented Apr 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangliliang commented Apr 21, 2019

Uh oh!

ezyang commented Apr 22, 2019

Uh oh!

ssnl commented Apr 22, 2019

Uh oh!

ezyang commented Apr 22, 2019

Uh oh!

ssnl commented May 6, 2019

Uh oh!

ssnl commented May 6, 2019

Uh oh!

ssnl commented May 6, 2019

Uh oh!

zhangliliang commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssnl commented May 6, 2019

Uh oh!

zhangliliang commented May 6, 2019

Uh oh!

zhangliliang commented May 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented May 6, 2019

Uh oh!

ezyang commented May 6, 2019

Uh oh!

zhangliliang commented May 7, 2019

Uh oh!

zhangliliang commented May 7, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zhangliliang commented Apr 13, 2019 •

edited

Loading

zhangliliang commented Apr 17, 2019 •

edited

Loading

ssnl commented Apr 20, 2019 •

edited

Loading

zhangliliang commented Apr 21, 2019 •

edited

Loading

zhangliliang commented May 6, 2019 •

edited

Loading

zhangliliang commented May 6, 2019 •

edited

Loading