Skip to content

Conversation

@agolynski
Copy link
Contributor

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Differential Revision: D18209289

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In torch/csrc/distributed/c10d/comm.cpp, we have implemented broadcast_coalesced as a function that takes a process group and a vector of tensors. I think we should do the same for allgather_coalesced, since it can be built on top of a regular allgather, regardless of the process group implementation. In the current approach, we'd need to duplicate the functionality between the process groups.

@mrshenli
Copy link
Contributor

I think we should do the same for allgather_coalesced, since it can be built on top of a regular allgather, regardless of the process group implementation.

@pietern just to make sure we understand your comment. Are you suggesting moving the allgather_coalesced implementation in this PR to comm.cpp? In that case, 1) do we still expose ProcessGroup::allgather_coalesced API or does the API only live in torch.distributed? 2) I would assume we should do the same for allreduce_coalesced too?

@pietern
Copy link
Contributor

pietern commented Oct 30, 2019

Yes, I think it should live as a separate function. I just checked it out and see there is indeed the coalesced variant defined in the ProcessGroup base class, but only defined for the Gloo implementation. I think the coalesced variants are generic enough not to need multiple implementations and that they can be built on top of the collectives we expose today. Then both CPU and GPU support and Gloo and NCCL support come for free.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@agolynski
Copy link
Contributor Author

Yes, I think it should live as a separate function. I just checked it out and see there is indeed the coalesced variant defined in the ProcessGroup base class, but only defined for the Gloo implementation. I think the coalesced variants are generic enough not to need multiple implementations and that they can be built on top of the collectives we expose today. Then both CPU and GPU support and Gloo and NCCL support come for free.

Agreed. As discussed offline, let's do it in a followup PR?

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I added some cosmetic comments. Based on the offline discussion, we will first land this PR. Then, @zappa- will work on a followup PR to move this implementation to comm.cpp.

Thanks a lot for working on this! Please get another stamp from @pietern before landing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for not using assertNonEmpty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, you are trying to keep consistent with ProcessGroupGloo::allgather.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These checks are already done in ProcessGroupGloo::allgather_coalesced, any reason for doing it again here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to linter complaining about possible out of bound access on element 0 of these vectors

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this.

@zappa- Can you create issues for the follow up work, one to move this functionality in comm.cpp, and one to move allreduce_coalesced into comm.cpp?

@kostmo
Copy link
Member

kostmo commented Oct 31, 2019

CircleCI build failures summary

As of commit 15b2fc8:

  • 0/2 flaky

Here are the reasons each build failed.


This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 2 time(s).

@agolynski
Copy link
Contributor Author

added issue #29033 to move allgather_coalesced to comm.h. Unfortunately I cannot assign this since I don't have write permissions to repo.

@agolynski
Copy link
Contributor Author

Thank you for reviews! Shipping it...

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

Summary: Pull Request resolved: pytorch#28857

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: mrshenli

Differential Revision: D18209289

fbshipit-source-id: f5f394635e60a9d4ce016e366415d90f1daa4423
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D18209289

@ezyang
Copy link
Contributor

ezyang commented Nov 1, 2019

This breaks CUDA builds.

Nov 01 18:14:20 /opt/cache/lib/nvcc /var/lib/jenkins/workspace/modules/detectron/spatial_narrow_as_op.cu -c -o /var/lib/jenkins/cpp-build/caffe2/build/modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir//./caffe2_detectron_ops_gpu_generated_spatial_narrow_as_op.cu.o -ccbin /opt/cache/bin/cc -m64 -Dcaffe2_detectron_ops_gpu_EXPORTS -DTH_BLAS_MKL -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DMAGMA_V2 -DIDEEP_USE_MKL -DHAVE_MMAP=1 -D_FILE_OFFSET_BITS=64 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHAVE_MALLOC_USABLE_SIZE=1 -D_THP_CORE -DAT_PARALLEL_OPENMP=1 -Xcompiler ,\"-Wno-deprecated\",\"-fvisibility-inlines-hidden\",\"-fopenmp\",\"-DUSE_FBGEMM\",\"-DUSE_QNNPACK\",\"-DUSE_PYTORCH_QNNPACK\",\"-O2\",\"-fPIC\",\"-Wno-narrowing\",\"-Wall\",\"-Wextra\",\"-Wno-missing-field-initializers\",\"-Wno-type-limits\",\"-Wno-array-bounds\",\"-Wno-unknown-pragmas\",\"-Wno-sign-compare\",\"-Wno-unused-parameter\",\"-Wno-unused-variable\",\"-Wno-unused-function\",\"-Wno-unused-result\",\"-Wno-strict-overflow\",\"-Wno-strict-aliasing\",\"-Wno-error=deprecated-declarations\",\"-Wno-error=pedantic\",\"-Wno-error=redundant-decls\",\"-Wno-error=old-style-cast\",\"-fdiagnostics-color=always\",\"-Wno-unused-but-set-variable\",\"-Wno-maybe-uninitialized\",\"-fno-math-errno\",\"-fno-trapping-math\",\"-fPIC\",\"-g\",\"-fno-omit-frame-pointer\",\"-O0\" -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=useless_using_declaration -Xcudafe --diag_suppress=set_but_not_used -std=c++11 -Xcompiler -fPIC --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -Xfatbin -compress-all -gencode arch=compute_52,code=sm_52 -Xcompiler -fPIC -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DNVCC -I/usr/local/cuda/include -I/var/lib/jenkins/cpp-build/caffe2/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/cpp-build/caffe2/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/gloo -I/var/lib/jenkins/workspace/cmake/../third_party/gloo -I/var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -I/var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -I/var/lib/jenkins/workspace/third_party/protobuf/src -I/opt/conda/include -I/var/lib/jenkins/workspace/third_party/gemmlowp -I/var/lib/jenkins/workspace/third_party/neon2sse -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party -I/usr/include -I/var/lib/jenkins/workspace/cmake/../third_party/eigen -I/var/lib/jenkins/workspace/cmake/../third_party/pybind11/include -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -I/opt/rocm/hip/include -I/include -I/var/lib/jenkins/workspace/cmake/../third_party/cub -I/var/lib/jenkins/cpp-build/caffe2/build/caffe2/contrib/aten -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/foxi -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/include -I/var/lib/jenkins/workspace/caffe2/../torch/csrc/api -I/var/lib/jenkins/workspace/caffe2/../torch/csrc/api/include -I/var/lib/jenkins/workspace/c10/../ -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/c10/cuda/../..
Nov 01 18:14:21 In file included from /usr/include/x86_64-linux-gnu/c++/5/bits/c++allocator.h:33:0,
Nov 01 18:14:21                  from /usr/include/c++/5/bits/allocator.h:46,
Nov 01 18:14:21                  from /usr/include/c++/5/string:41,
Nov 01 18:14:21                  from /usr/include/c++/5/stdexcept:39,
Nov 01 18:14:21                  from /usr/include/c++/5/array:38,
Nov 01 18:14:21                  from /usr/include/c++/5/tuple:39,
Nov 01 18:14:21                  from /usr/include/c++/5/mutex:38,
Nov 01 18:14:21                  from /usr/include/c++/5/condition_variable:39,
Nov 01 18:14:21                  from /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupMPI.hpp:3,
Nov 01 18:14:21                  from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:1:
Nov 01 18:14:21 /usr/include/c++/5/ext/new_allocator.h: In instantiation of 'void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = c10d::ProcessGroupMPI; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI]':
Nov 01 18:14:21 /usr/include/c++/5/bits/alloc_traits.h:530:4:   required from 'static void std::allocator_traits<std::allocator<_CharT> >::construct(std::allocator_traits<std::allocator<_CharT> >::allocator_type&, _Up*, _Args&& ...) [with _Up = c10d::ProcessGroupMPI; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI; std::allocator_traits<std::allocator<_CharT> >::allocator_type = std::allocator<c10d::ProcessGroupMPI>]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr_base.h:522:39:   required from 'std::_Sp_counted_ptr_inplace<_Tp, _Alloc, _Lp>::_Sp_counted_ptr_inplace(_Alloc, _Args&& ...) [with _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI; _Alloc = std::allocator<c10d::ProcessGroupMPI>; __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2u]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr_base.h:617:4:   required from 'std::__shared_count<_Lp>::__shared_count(std::_Sp_make_shared_tag, _Tp*, const _Alloc&, _Args&& ...) [with _Tp = c10d::ProcessGroupMPI; _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}; __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2u]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr_base.h:1097:35:   required from 'std::__shared_ptr<_Tp, _Lp>::__shared_ptr(std::_Sp_make_shared_tag, const _Alloc&, _Args&& ...) [with _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI; __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2u]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr.h:319:64:   required from 'std::shared_ptr<_Tp>::shared_ptr(std::_Sp_make_shared_tag, const _Alloc&, _Args&& ...) [with _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr.h:620:39:   required from 'std::shared_ptr<_Tp1> std::allocate_shared(const _Alloc&, _Args&& ...) [with _Tp = c10d::ProcessGroupMPI; _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr.h:635:39:   required from 'std::shared_ptr<_Tp1> std::make_shared(_Args&& ...) [with _Tp = c10d::ProcessGroupMPI; _Args = {int&, int&, ompi_communicator_t*&}]'
Nov 01 18:14:21 /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:237:65:   required from here
Nov 01 18:14:21 /usr/include/c++/5/ext/new_allocator.h:120:4: error: invalid new-expression of abstract class type 'c10d::ProcessGroupMPI'
Nov 01 18:14:21   { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
Nov 01 18:14:21     ^
Nov 01 18:14:21 In file included from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:1:0:
Nov 01 18:14:21 /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupMPI.hpp:73:7: note:   because the following virtual functions are pure within 'c10d::ProcessGroupMPI':
Nov 01 18:14:21  class ProcessGroupMPI : public ProcessGroup {
Nov 01 18:14:21        ^
Nov 01 18:14:21 In file included from /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupMPI.hpp:11:0,
Nov 01 18:14:21                  from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:1:
Nov 01 18:14:21 /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroup.hpp:127:47: note: virtual std::shared_ptr<c10d::ProcessGroup::Work> c10d::ProcessGroup::allgather_coalesced(std::vector<std::vector<at::Tensor> >&, std::vector<at::Tensor>&, const c10d::AllgatherOptions&)
Nov 01 18:14:21    virtual std::shared_ptr<ProcessGroup::Work> allgather_coalesced(
Nov 01 18:14:21                                                ^
Nov 01 18:14:21 caffe2/lib_c10d/CMakeFiles/c10d.dir/build.make:278: recipe for target 'caffe2/lib_c10d/CMakeFiles/c10d.dir/ProcessGroupMPI.cpp.o' failed
Nov 01 18:14:21 make[2]: Leaving directory '/var/lib/jenkins/cpp-build/caffe2/build'
Nov 01 18:14:21 make[2]: *** [caffe2/lib_c10d/CMakeFiles/c10d.dir/ProcessGroupMPI.cpp.o] Error 1

@ezyang
Copy link
Contributor

ezyang commented Nov 1, 2019

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 22a346e.

@rohan-varma
Copy link
Contributor

@rohan-varma
Copy link
Contributor

Sorry, spoke too soon. it has been reverted by 9041e29

agolynski added a commit to agolynski/pytorch that referenced this pull request Nov 4, 2019
Summary:
Pull Request resolved: pytorch#29059
This is a resubmit of reverted diff D18209289 ( PR pytorch#28857 ).

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: pietern

Differential Revision: D18277097

fbshipit-source-id: 3e16c4c5f71e5c051ffef280e021bd253caf127c
facebook-github-bot pushed a commit that referenced this pull request Nov 4, 2019
Summary:
Pull Request resolved: #29059
This is a resubmit of reverted diff D18209289 ( PR #28857 ).

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Reviewed By: pietern

Differential Revision: D18277097

fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants