Skip to content

Conversation

@pietern
Copy link
Contributor

@pietern pietern commented May 7, 2019

Stack:
    :white_circle:  #20236 Make DistributedDataParallel usable with CPU models  💚
    :black_circle:  #20235 Refactor core DistributedDataParallel tests  💚
    :white_circle:  #20234 Add c10d::broadcast_coalesced and tests  💚

The tests expected to only run for CUDA models. In a future commit we
need to update this to work for CPU models as well. Therefore, we can
no longer rely on only integers being passed for device identifiers.
With this change we pass both the materialized list of devices to use
(as torch.Device objects), as well as an optional list of integers.
The latter is specified to exercise the code in the
DistributedDataParallel constructor that turns a list of integers into
CUDA devices, IFF it is used to wrap a single-device CUDA module.

This commit also groups together the 'str' and non-'str' tests. These
used to test passing the list of devices as integers or as
torch.Device instances. These are now executed from the same test.

Differential Revision: D15245429

Differential Revision: D15245429
Differential Version: 81320055
@pietern pietern requested review from apaszke and mrshenli as code owners May 7, 2019 19:25
@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label May 7, 2019
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If device_ids is indeed unnecessary, shall we consolidate?

copy.deepcopy(model).cuda(gpus[0]),
device_ids=gpus,
copy.deepcopy(model).to(devices[0]),
device_ids=device_ids,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_ids in DDP should be able to directly handle devices, as we wrote:

device_ids (list of int or torch.device): CUDA devices. This should ...

Copy link
Contributor Author

@pietern pietern May 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing tests exercised both and I didn't want to change that. We should test both the list of integers and list of torch.device cases in the suite (it could be broken out as a separate lightweight test instead of performing a full end to end tests).

pietern added 3 commits May 7, 2019 13:40
Differential Revision: D15245429
Differential Version: 81326667
Differential Revision: D15245429
Differential Version: 81330801
Differential Revision: D15245429
Differential Version: 81461821
@mrshenli
Copy link
Contributor

mrshenli commented May 9, 2019

@pytorchbot retest this please

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in f32c9bd.

@pietern pietern deleted the export-D15245429 branch May 9, 2019 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants