Skip to content

Complex Number support for distributed #45760

@anjali411

Description

@anjali411

🚀 Feature

As per title, complex numbers should be supported in torch.distributed.

Motivation

Distribute computing support for complex numbers came up in conversations with people at Argonne National Laboratory and Flatiron Institute. Currently, some of them use Uber's Horovod library for distributed computing. The operations that they commonly use are all_reduce and broadcasting operations.

Pitch

  1. NCCL only defines floating and integer data types, so one way to get complex working with NCCL is by viewing complex tensors as real using torch.view_as_real(complex_tensor) which returns an equivalent R^2 floating point tensor. Since the view shares the same storage with the original complex tensor, we probably don’t need to convert it back to complex. If needed, torch.view_as_complex can be used to convert the real tensor back to complex tensor.
  2. Both torch.view_as_real and torch.view_as_complex are view operations and O(1).
    >>>z=torch.randn(4, dtype=torch.cfloat) 
    >>>torch.view_as_real(z)
    tensor([[-0.4226, 0.5459],
            [ 0.9385, 1.1723],
            [-0.9454, -0.3572],
            [ 0.0624, 1.0193]])
  1. Ops that should work with complex:
    1. torch.distributed.all_reduce
    2. torch.distributed.all_gather
  2. Testing
    1. https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d.py
    2. https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/distributed/distributed_test.py

cc @ezyang @anjali411 @dylanbespalko @mruberry @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @xush6528 @osalpekar @jiayisuse @agolynski

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: bootcampWe plan to do a full writeup on the issue, and then get someone to do it for onboardingmodule: c10dIssues/PRs related to collective communications and process groupsmodule: complexRelated to complex number support in PyTorchoncall: distributedAdd this issue/PR to distributed oncall triage queuept_distributed_rampupRamp up tasks for new developers on PT distributedtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions