Moving python allgather_coalesced impl from Py to C #28857

agolynski · 2019-10-29T22:32:40Z

Test Plan:
buck test caffe2/test:c10d
buck test caffe2/test:distributed_gloo

Differential Revision: D18209289

facebook-github-bot · 2019-10-29T22:33:18Z

This pull request was exported from Phabricator. Differential Revision: D18209289

facebook-github-bot · 2019-10-29T23:28:28Z

This pull request was exported from Phabricator. Differential Revision: D18209289

facebook-github-bot · 2019-10-29T23:55:48Z

This pull request was exported from Phabricator. Differential Revision: D18209289

facebook-github-bot · 2019-10-30T00:00:15Z

This pull request was exported from Phabricator. Differential Revision: D18209289

facebook-github-bot · 2019-10-30T13:10:26Z

This pull request was exported from Phabricator. Differential Revision: D18209289

pietern

In torch/csrc/distributed/c10d/comm.cpp, we have implemented broadcast_coalesced as a function that takes a process group and a vector of tensors. I think we should do the same for allgather_coalesced, since it can be built on top of a regular allgather, regardless of the process group implementation. In the current approach, we'd need to duplicate the functionality between the process groups.

mrshenli · 2019-10-30T14:10:56Z

I think we should do the same for allgather_coalesced, since it can be built on top of a regular allgather, regardless of the process group implementation.

@pietern just to make sure we understand your comment. Are you suggesting moving the allgather_coalesced implementation in this PR to comm.cpp? In that case, 1) do we still expose ProcessGroup::allgather_coalesced API or does the API only live in torch.distributed? 2) I would assume we should do the same for allreduce_coalesced too?

pietern · 2019-10-30T15:19:03Z

Yes, I think it should live as a separate function. I just checked it out and see there is indeed the coalesced variant defined in the ProcessGroup base class, but only defined for the Gloo implementation. I think the coalesced variants are generic enough not to need multiple implementations and that they can be built on top of the collectives we expose today. Then both CPU and GPU support and Gloo and NCCL support come for free.

facebook-github-bot · 2019-10-30T19:07:59Z

This pull request was exported from Phabricator. Differential Revision: D18209289

agolynski · 2019-10-30T19:27:41Z

Yes, I think it should live as a separate function. I just checked it out and see there is indeed the coalesced variant defined in the ProcessGroup base class, but only defined for the Gloo implementation. I think the coalesced variants are generic enough not to need multiple implementations and that they can be built on top of the collectives we expose today. Then both CPU and GPU support and Gloo and NCCL support come for free.

Agreed. As discussed offline, let's do it in a followup PR?

facebook-github-bot · 2019-10-31T01:01:37Z

This pull request was exported from Phabricator. Differential Revision: D18209289

mrshenli

LGTM. I added some cosmetic comments. Based on the offline discussion, we will first land this PR. Then, @zappa- will work on a followup PR to move this implementation to comm.cpp.

Thanks a lot for working on this! Please get another stamp from @pietern before landing.

mrshenli · 2019-10-30T21:56:24Z

torch/lib/c10d/ProcessGroupGloo.cpp

any reason for not using assertNonEmpty?

Oh I see, you are trying to keep consistent with ProcessGroupGloo::allgather.

torch/lib/c10d/Utils.hpp

mrshenli · 2019-10-31T01:04:34Z

torch/lib/c10d/ProcessGroupGloo.cpp

These checks are already done in ProcessGroupGloo::allgather_coalesced, any reason for doing it again here?

This is due to linter complaining about possible out of bound access on element 0 of these vectors

torch/lib/c10d/Utils.hpp

facebook-github-bot · 2019-10-31T13:37:33Z

This pull request was exported from Phabricator. Differential Revision: D18209289

pietern

Let's merge this.

@zappa- Can you create issues for the follow up work, one to move this functionality in comm.cpp, and one to move allreduce_coalesced into comm.cpp?

torch/lib/c10d/ProcessGroupGloo.cpp

kostmo · 2019-10-31T14:06:55Z

CircleCI build failures summary

As of commit 15b2fc8:

0/2 flaky

Here are the reasons each build failed.

This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 2 time(s).

agolynski · 2019-11-01T14:55:48Z

added issue #29033 to move allgather_coalesced to comm.h. Unfortunately I cannot assign this since I don't have write permissions to repo.

agolynski · 2019-11-01T14:56:26Z

Thank you for reviews! Shipping it...

facebook-github-bot · 2019-11-01T14:59:44Z

This pull request was exported from Phabricator. Differential Revision: D18209289

facebook-github-bot · 2019-11-01T15:02:48Z

This pull request was exported from Phabricator. Differential Revision: D18209289

facebook-github-bot · 2019-11-01T15:10:39Z

This pull request was exported from Phabricator. Differential Revision: D18209289

Summary: Pull Request resolved: pytorch#28857 Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: mrshenli Differential Revision: D18209289 fbshipit-source-id: f5f394635e60a9d4ce016e366415d90f1daa4423

facebook-github-bot · 2019-11-01T15:20:13Z

This pull request was exported from Phabricator. Differential Revision: D18209289

ezyang · 2019-11-01T18:19:11Z

This breaks CUDA builds.

Nov 01 18:14:20 /opt/cache/lib/nvcc /var/lib/jenkins/workspace/modules/detectron/spatial_narrow_as_op.cu -c -o /var/lib/jenkins/cpp-build/caffe2/build/modules/detectron/CMakeFiles/caffe2_detectron_ops_gpu.dir//./caffe2_detectron_ops_gpu_generated_spatial_narrow_as_op.cu.o -ccbin /opt/cache/bin/cc -m64 -Dcaffe2_detectron_ops_gpu_EXPORTS -DTH_BLAS_MKL -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DMAGMA_V2 -DIDEEP_USE_MKL -DHAVE_MMAP=1 -D_FILE_OFFSET_BITS=64 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHAVE_MALLOC_USABLE_SIZE=1 -D_THP_CORE -DAT_PARALLEL_OPENMP=1 -Xcompiler ,\"-Wno-deprecated\",\"-fvisibility-inlines-hidden\",\"-fopenmp\",\"-DUSE_FBGEMM\",\"-DUSE_QNNPACK\",\"-DUSE_PYTORCH_QNNPACK\",\"-O2\",\"-fPIC\",\"-Wno-narrowing\",\"-Wall\",\"-Wextra\",\"-Wno-missing-field-initializers\",\"-Wno-type-limits\",\"-Wno-array-bounds\",\"-Wno-unknown-pragmas\",\"-Wno-sign-compare\",\"-Wno-unused-parameter\",\"-Wno-unused-variable\",\"-Wno-unused-function\",\"-Wno-unused-result\",\"-Wno-strict-overflow\",\"-Wno-strict-aliasing\",\"-Wno-error=deprecated-declarations\",\"-Wno-error=pedantic\",\"-Wno-error=redundant-decls\",\"-Wno-error=old-style-cast\",\"-fdiagnostics-color=always\",\"-Wno-unused-but-set-variable\",\"-Wno-maybe-uninitialized\",\"-fno-math-errno\",\"-fno-trapping-math\",\"-fPIC\",\"-g\",\"-fno-omit-frame-pointer\",\"-O0\" -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored -Xcudafe --diag_suppress=integer_sign_change -Xcudafe --diag_suppress=useless_using_declaration -Xcudafe --diag_suppress=set_but_not_used -std=c++11 -Xcompiler -fPIC --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -Xfatbin -compress-all -gencode arch=compute_52,code=sm_52 -Xcompiler -fPIC -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DNVCC -I/usr/local/cuda/include -I/var/lib/jenkins/cpp-build/caffe2/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/cpp-build/caffe2/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/gloo -I/var/lib/jenkins/workspace/cmake/../third_party/gloo -I/var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -I/var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -I/var/lib/jenkins/workspace/third_party/protobuf/src -I/opt/conda/include -I/var/lib/jenkins/workspace/third_party/gemmlowp -I/var/lib/jenkins/workspace/third_party/neon2sse -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party -I/usr/include -I/var/lib/jenkins/workspace/cmake/../third_party/eigen -I/var/lib/jenkins/workspace/cmake/../third_party/pybind11/include -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent -I/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include -I/usr/lib/openmpi/include -I/usr/lib/openmpi/include/openmpi -I/opt/rocm/hip/include -I/include -I/var/lib/jenkins/workspace/cmake/../third_party/cub -I/var/lib/jenkins/cpp-build/caffe2/build/caffe2/contrib/aten -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/foxi -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/include -I/var/lib/jenkins/workspace/caffe2/../torch/csrc/api -I/var/lib/jenkins/workspace/caffe2/../torch/csrc/api/include -I/var/lib/jenkins/workspace/c10/../ -I/var/lib/jenkins/cpp-build/caffe2/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/c10/cuda/../..
Nov 01 18:14:21 In file included from /usr/include/x86_64-linux-gnu/c++/5/bits/c++allocator.h:33:0,
Nov 01 18:14:21                  from /usr/include/c++/5/bits/allocator.h:46,
Nov 01 18:14:21                  from /usr/include/c++/5/string:41,
Nov 01 18:14:21                  from /usr/include/c++/5/stdexcept:39,
Nov 01 18:14:21                  from /usr/include/c++/5/array:38,
Nov 01 18:14:21                  from /usr/include/c++/5/tuple:39,
Nov 01 18:14:21                  from /usr/include/c++/5/mutex:38,
Nov 01 18:14:21                  from /usr/include/c++/5/condition_variable:39,
Nov 01 18:14:21                  from /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupMPI.hpp:3,
Nov 01 18:14:21                  from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:1:
Nov 01 18:14:21 /usr/include/c++/5/ext/new_allocator.h: In instantiation of 'void __gnu_cxx::new_allocator<_Tp>::construct(_Up*, _Args&& ...) [with _Up = c10d::ProcessGroupMPI; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI]':
Nov 01 18:14:21 /usr/include/c++/5/bits/alloc_traits.h:530:4:   required from 'static void std::allocator_traits<std::allocator<_CharT> >::construct(std::allocator_traits<std::allocator<_CharT> >::allocator_type&, _Up*, _Args&& ...) [with _Up = c10d::ProcessGroupMPI; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI; std::allocator_traits<std::allocator<_CharT> >::allocator_type = std::allocator<c10d::ProcessGroupMPI>]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr_base.h:522:39:   required from 'std::_Sp_counted_ptr_inplace<_Tp, _Alloc, _Lp>::_Sp_counted_ptr_inplace(_Alloc, _Args&& ...) [with _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI; _Alloc = std::allocator<c10d::ProcessGroupMPI>; __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2u]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr_base.h:617:4:   required from 'std::__shared_count<_Lp>::__shared_count(std::_Sp_make_shared_tag, _Tp*, const _Alloc&, _Args&& ...) [with _Tp = c10d::ProcessGroupMPI; _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}; __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2u]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr_base.h:1097:35:   required from 'std::__shared_ptr<_Tp, _Lp>::__shared_ptr(std::_Sp_make_shared_tag, const _Alloc&, _Args&& ...) [with _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI; __gnu_cxx::_Lock_policy _Lp = (__gnu_cxx::_Lock_policy)2u]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr.h:319:64:   required from 'std::shared_ptr<_Tp>::shared_ptr(std::_Sp_make_shared_tag, const _Alloc&, _Args&& ...) [with _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}; _Tp = c10d::ProcessGroupMPI]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr.h:620:39:   required from 'std::shared_ptr<_Tp1> std::allocate_shared(const _Alloc&, _Args&& ...) [with _Tp = c10d::ProcessGroupMPI; _Alloc = std::allocator<c10d::ProcessGroupMPI>; _Args = {int&, int&, ompi_communicator_t*&}]'
Nov 01 18:14:21 /usr/include/c++/5/bits/shared_ptr.h:635:39:   required from 'std::shared_ptr<_Tp1> std::make_shared(_Args&& ...) [with _Tp = c10d::ProcessGroupMPI; _Args = {int&, int&, ompi_communicator_t*&}]'
Nov 01 18:14:21 /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:237:65:   required from here
Nov 01 18:14:21 /usr/include/c++/5/ext/new_allocator.h:120:4: error: invalid new-expression of abstract class type 'c10d::ProcessGroupMPI'
Nov 01 18:14:21   { ::new((void *)__p) _Up(std::forward<_Args>(__args)...); }
Nov 01 18:14:21     ^
Nov 01 18:14:21 In file included from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:1:0:
Nov 01 18:14:21 /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupMPI.hpp:73:7: note:   because the following virtual functions are pure within 'c10d::ProcessGroupMPI':
Nov 01 18:14:21  class ProcessGroupMPI : public ProcessGroup {
Nov 01 18:14:21        ^
Nov 01 18:14:21 In file included from /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroupMPI.hpp:11:0,
Nov 01 18:14:21                  from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroupMPI.cpp:1:
Nov 01 18:14:21 /var/lib/jenkins/workspace/torch/lib/c10d/../c10d/ProcessGroup.hpp:127:47: note: virtual std::shared_ptr<c10d::ProcessGroup::Work> c10d::ProcessGroup::allgather_coalesced(std::vector<std::vector<at::Tensor> >&, std::vector<at::Tensor>&, const c10d::AllgatherOptions&)
Nov 01 18:14:21    virtual std::shared_ptr<ProcessGroup::Work> allgather_coalesced(
Nov 01 18:14:21                                                ^
Nov 01 18:14:21 caffe2/lib_c10d/CMakeFiles/c10d.dir/build.make:278: recipe for target 'caffe2/lib_c10d/CMakeFiles/c10d.dir/ProcessGroupMPI.cpp.o' failed
Nov 01 18:14:21 make[2]: Leaving directory '/var/lib/jenkins/cpp-build/caffe2/build'
Nov 01 18:14:21 make[2]: *** [caffe2/lib_c10d/CMakeFiles/c10d.dir/ProcessGroupMPI.cpp.o] Error 1

ezyang · 2019-11-01T18:19:47Z

https://app.circleci.com/jobs/github/pytorch/pytorch/3437322

facebook-github-bot · 2019-11-01T19:09:43Z

This pull request has been merged in 22a346e.

rohan-varma · 2019-11-01T22:23:05Z

I'm getting a similar error to @ezyang: https://circleci.com/api/v1.1/project/github/pytorch/pytorch/3443618/output/106/0?file=true. Is this going to be fixed/reverted?

rohan-varma · 2019-11-01T22:23:41Z

Sorry, spoke too soon. it has been reverted by 9041e29

Summary: Pull Request resolved: pytorch#29059 This is a resubmit of reverted diff D18209289 ( PR pytorch#28857 ). Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: pietern Differential Revision: D18277097 fbshipit-source-id: 3e16c4c5f71e5c051ffef280e021bd253caf127c

Summary: Pull Request resolved: #29059 This is a resubmit of reverted diff D18209289 ( PR #28857 ). Test Plan: buck test caffe2/test:c10d buck test caffe2/test:distributed_gloo Reviewed By: pietern Differential Revision: D18277097 fbshipit-source-id: aecfd7206d70829f0cac66182bf02fccee410fed

agolynski requested review from apaszke, mrshenli and pietern as code owners October 29, 2019 22:32

agolynski force-pushed the export-D18209289 branch from 753f5b0 to 0482133 Compare October 29, 2019 23:28

agolynski force-pushed the export-D18209289 branch from 0482133 to e5c9943 Compare October 29, 2019 23:55

agolynski force-pushed the export-D18209289 branch from e5c9943 to 05c47f2 Compare October 30, 2019 00:00

agolynski force-pushed the export-D18209289 branch from 05c47f2 to 02a8faa Compare October 30, 2019 13:10

pietern suggested changes Oct 30, 2019

View reviewed changes

agolynski force-pushed the export-D18209289 branch from 02a8faa to 618c817 Compare October 30, 2019 19:07

agolynski force-pushed the export-D18209289 branch from 618c817 to 6a310ab Compare October 31, 2019 01:01

mrshenli approved these changes Oct 31, 2019

View reviewed changes

agolynski force-pushed the export-D18209289 branch from 6a310ab to 15b2fc8 Compare October 31, 2019 13:37

pietern approved these changes Oct 31, 2019

View reviewed changes

torch/lib/c10d/ProcessGroupGloo.cpp Outdated Show resolved Hide resolved

agolynski mentioned this pull request Nov 1, 2019

Move allgather_coalesced functionality to comm.cpp. #29033

Open

agolynski force-pushed the export-D18209289 branch from 15b2fc8 to ac3b71d Compare November 1, 2019 14:59

agolynski force-pushed the export-D18209289 branch from ac3b71d to 1091cb5 Compare November 1, 2019 15:02

agolynski force-pushed the export-D18209289 branch from 1091cb5 to 70f7c0f Compare November 1, 2019 15:10

agolynski force-pushed the export-D18209289 branch from 70f7c0f to 733fb07 Compare November 1, 2019 15:20

facebook-github-bot closed this in 22a346e Nov 1, 2019

facebook-github-bot added the merged label Nov 1, 2019

agolynski mentioned this pull request Nov 1, 2019

Move allgather_coalesced implementation from Python to C++ #29059

Closed

mruberry added the Merged label Oct 28, 2020

Moving python allgather_coalesced impl from Py to C #28857

Moving python allgather_coalesced impl from Py to C #28857

Uh oh!

Conversation

agolynski commented Oct 29, 2019

Uh oh!

facebook-github-bot commented Oct 29, 2019

Uh oh!

facebook-github-bot commented Oct 29, 2019

Uh oh!

facebook-github-bot commented Oct 29, 2019

Uh oh!

facebook-github-bot commented Oct 30, 2019

Uh oh!

facebook-github-bot commented Oct 30, 2019

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Oct 30, 2019

Uh oh!

pietern commented Oct 30, 2019

Uh oh!

facebook-github-bot commented Oct 30, 2019

Uh oh!

agolynski commented Oct 30, 2019

Uh oh!

facebook-github-bot commented Oct 31, 2019

Uh oh!

mrshenli left a comment

Choose a reason for hiding this comment

Uh oh!

mrshenli Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mrshenli Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

agolynski Oct 31, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Oct 31, 2019

Uh oh!

pietern left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kostmo commented Oct 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CircleCI build failures summary

Uh oh!

agolynski commented Nov 1, 2019

Uh oh!

agolynski commented Nov 1, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

ezyang commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang commented Nov 1, 2019

Uh oh!

facebook-github-bot commented Nov 1, 2019

Uh oh!

rohan-varma commented Nov 1, 2019

Uh oh!

rohan-varma commented Nov 1, 2019

Uh oh!

kostmo commented Oct 31, 2019 •

edited

Loading

ezyang commented Nov 1, 2019 •

edited

Loading