Skip to content

Conversation

@xush6528
Copy link
Contributor

@xush6528 xush6528 commented Nov 23, 2019

Stack from ghstack:

  • Postpone acquiring lock.
  • Release lock earlier.
  • Notify conf var after releasing lock.
  • Reduce unnecessary scan on future and future timeout maps.
  • Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: D5516149

- Make processTimedOutFutures hold lock
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
@xush6528 xush6528 changed the title Tweak pollTimedOutRPCs thread syncrinization Tweak pollTimedOutRPCs thread synchronization Nov 23, 2019
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Nov 23, 2019
Pull Request resolved: #30355

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

ghstack-source-id: 94476372

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Nov 24, 2019
Pull Request resolved: #30355

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

ghstack-source-id: 94487097

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)
@xush6528 xush6528 added the module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer label Dec 3, 2019
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Do we know why we hit failures in MacOS test?

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
@xush6528 xush6528 changed the title Tweak pollTimedOutRPCs thread synchronization [WIP] Tweak pollTimedOutRPCs thread synchronization Dec 10, 2019
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Dec 11, 2019
Pull Request resolved: #30355

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

ghstack-source-id: 95326061

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)
@xush6528 xush6528 changed the title [WIP] Tweak pollTimedOutRPCs thread synchronization Tweak pollTimedOutRPCs thread synchronization Dec 11, 2019
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
Copy link

@jjlilley jjlilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
@kostmo
Copy link
Member

kostmo commented Dec 12, 2019

CircleCI build failures summary

As of commit e29a7c8:

  • 7/7 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

6 new failures recognized by patterns

The following build failures don't appear to be due to upstream breakage:

See CircleCI build pytorch_linux_xenial_py3_clang5_asan_build (1/6)

Step: "Build" (details)

Dec 12 00:32:23   for (int i = 1; i < tensors.size(); ++i) {
Dec 12 00:32:23                   ~ ^ ~~~~~~~~~~~~~~
Dec 12 00:32:23 1 warning generated.
Dec 12 00:32:23 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o
Dec 12 00:32:23 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/py_rref.cpp.o
Dec 12 00:32:23 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/python_functions.cpp.o
Dec 12 00:32:24 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/python_rpc_handler.cpp.o
Dec 12 00:32:24 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/request_callback_impl.cpp.o
Dec 12 00:32:24 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/rref_context.cpp.o
Dec 12 00:32:24 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/rref.cpp.o
Dec 12 00:32:31 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp:32:24: error: non-static data member defined out-of-line
Dec 12 00:32:31     ProcessGroupAgent::kInfiniteTimeoutTimePoint =
Dec 12 00:32:31     ~~~~~~~~~~~~~~~~~~~^
Dec 12 00:32:31 1 error generated.
Dec 12 00:32:31 caffe2/torch/CMakeFiles/torch_python.dir/build.make:2678: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o' failed
Dec 12 00:32:31 make[2]: *** [caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o] Error 1
Dec 12 00:32:31 CMakeFiles/Makefile2:11458: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/all' failed
Dec 12 00:32:31 make[1]: *** [caffe2/torch/CMakeFiles/torch_python.dir/all] Error 2
Dec 12 00:32:31 make: *** [all] Error 2
Dec 12 00:32:31 Makefile:138: recipe for target 'all' failed
Dec 12 00:32:31 -- Building version 1.4.0a0+e29a7c8

See CircleCI build pytorch_xla_linux_xenial_py3_6_clang7_build (2/6)

Step: "Build" (details)

Dec 12 00:33:21   for (int i = 1; i < tensors.size(); ++i) {
Dec 12 00:33:21                   ~ ^ ~~~~~~~~~~~~~~
Dec 12 00:33:21 1 warning generated.
Dec 12 00:33:21 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o
Dec 12 00:33:21 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/py_rref.cpp.o
Dec 12 00:33:22 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/python_functions.cpp.o
Dec 12 00:33:22 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/python_rpc_handler.cpp.o
Dec 12 00:33:22 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/request_callback_impl.cpp.o
Dec 12 00:33:22 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/rref_context.cpp.o
Dec 12 00:33:23 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/rref.cpp.o
Dec 12 00:33:29 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp:32:24: error: non-static data member defined out-of-line
Dec 12 00:33:29     ProcessGroupAgent::kInfiniteTimeoutTimePoint =
Dec 12 00:33:29     ~~~~~~~~~~~~~~~~~~~^
Dec 12 00:33:29 1 error generated.
Dec 12 00:33:29 caffe2/torch/CMakeFiles/torch_python.dir/build.make:2678: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o' failed
Dec 12 00:33:29 make[2]: *** [caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o] Error 1
Dec 12 00:33:29 CMakeFiles/Makefile2:11690: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/all' failed
Dec 12 00:33:29 make[1]: *** [caffe2/torch/CMakeFiles/torch_python.dir/all] Error 2
Dec 12 00:33:29 make: *** [all] Error 2
Dec 12 00:33:29 Makefile:138: recipe for target 'all' failed
Dec 12 00:33:29 -- Building version 1.4.0a0+e29a7c8

See CircleCI build pytorch_macos_10_13_py3_build (3/6)

Step: "Build" (details)

Dec 11 16:33:33 [2960/2964] Building CXX object modules/observers/CMakeFiles/caffe2_observers.dir/perf_observer.cc.o
Dec 11 16:33:33 ../modules/observers/perf_observer.cc:190:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
Dec 11 16:33:33   if (skipIters <= numRuns_ && sampleRate > 0 && rand() % sampleRate == 0) {
Dec 11 16:33:33       ~~~~~~~~~ ^  ~~~~~~~~
Dec 11 16:33:33 ../modules/observers/perf_observer.cc:232:27: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<caffe2::OperatorBase *, std::__1::allocator<caffe2::OperatorBase *> >::size_type' (aka 'unsigned long') [-Wsign-compare]
Dec 11 16:33:33     for (int idx = 0; idx < operators.size(); ++idx) {
Dec 11 16:33:33                       ~~~ ^ ~~~~~~~~~~~~~~~~
Dec 11 16:33:33 2 warnings generated.
Dec 11 16:33:33 [2961/2964] Linking CXX shared library lib/libcaffe2_observers.dylib
Dec 11 16:33:36 [2962/2964] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o
Dec 11 16:33:36 FAILED: caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o 
Dec 11 16:33:36 /Users/distiller/workspace/clang++  -DAT_PARALLEL_OPENMP=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTHP_BUILD_MAIN_LIB -DTH_BLAS_MKL -DUSE_C10D -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_NUMPY -D_FILE_OFFSET_BITS=64 -Dtorch_python_EXPORTS -Iaten/src -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -I../torch/.. -I../torch/../aten/src -I../torch/../aten/src/TH -Icaffe2/aten/src -Ithird_party -I../torch/../third_party/gloo -I../torch/../third_party/onnx -I../torch/csrc -I../torch/csrc/api/include -I../torch/lib -I../torch/lib/libshm -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -I../torch/lib/libshm/../../../torch/lib -I../torch/lib/c10d/.. -I../torch/lib/c10d/../../.. -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /Users/distiller/workspace/miniconda3/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /Users/distiller/workspace/miniconda3/include/python3.7m -isystem /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /opt/rocm/hip/include -isystem /include -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -Wno-deprecated-declarations -Xpreprocessor -fopenmp -I/usr/local/include -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -fno-math-errno -fno-trapping-math -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -isysroot /Applications/Xcode-9.4.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk -mmacosx-version-min=10.9 -fPIC   -DCAFFE2_USE_GLOO -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces "-Xpreprocessor -fopenmp -I/usr/local/include" -std=gnu++14 -MD -MT caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o -MF caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o.d -o caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o -c ../torch/csrc/distributed/rpc/process_group_agent.cpp
Dec 11 16:33:36 ../torch/csrc/distributed/rpc/process_group_agent.cpp:32:24: error: non-static data member defined out-of-line
Dec 11 16:33:36     ProcessGroupAgent::kInfiniteTimeoutTimePoint =
Dec 11 16:33:36     ~~~~~~~~~~~~~~~~~~~^
Dec 11 16:33:36 1 error generated.
Dec 11 16:33:36 ninja: build stopped: subcommand failed.
Dec 11 16:33:36 Traceback (most recent call last):
Dec 11 16:33:36   File "setup.py", line 755, in <module>
Dec 11 16:33:36     build_deps()
Dec 11 16:33:36   File "setup.py", line 316, in build_deps

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_build (4/6)

Step: "Build" (details)

Dec 12 00:34:40                  from /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/message.h:3,
Dec 12 00:34:40                  from /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/future_message.h:3,
Dec 12 00:34:40                  from /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/python_functions.h:3,
Dec 12 00:34:40                  from /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/python_functions.cpp:1:
Dec 12 00:34:40 /var/lib/jenkins/workspace/c10/util/Exception.h:330:13: note: declared here
Dec 12 00:34:40  inline void deprecated_AT_CHECK() {}
Dec 12 00:34:40              ^
Dec 12 00:34:40 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/request_callback_impl.cpp.o
Dec 12 00:34:40 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/rref_context.cpp.o
Dec 12 00:34:40 [100%] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/rref.cpp.o
Dec 12 00:34:50 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp:32:24: error: 'const steady_clock_time_point torch::distributed::rpc::ProcessGroupAgent::kInfiniteTimeoutTimePoint' is not a static data member of 'class torch::distributed::rpc::ProcessGroupAgent'
Dec 12 00:34:50      ProcessGroupAgent::kInfiniteTimeoutTimePoint =
Dec 12 00:34:50                         ^
Dec 12 00:34:50 caffe2/torch/CMakeFiles/torch_python.dir/build.make:2678: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o' failed
Dec 12 00:34:50 make[2]: *** [caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o] Error 1
Dec 12 00:34:50 CMakeFiles/Makefile2:11393: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/all' failed
Dec 12 00:34:50 make[1]: *** [caffe2/torch/CMakeFiles/torch_python.dir/all] Error 2
Dec 12 00:34:50 Makefile:138: recipe for target 'all' failed
Dec 12 00:34:50 make: *** [all] Error 2
Dec 12 00:34:50 -- Building version 1.4.0a0+11bb6e5
Dec 12 00:34:50 cmake -DBUILD_ENVIRONMENT=pytorch-linux-xenial-py3.6-gcc5.4-build -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/var/lib/jenkins/workspace/torch -DCMAKE_PREFIX_PATH=/opt/conda/lib/python3.6/site-packages -DNUMPY_INCLUDE_DIR=/opt/conda/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/opt/conda/bin/python -DPYTHON_INCLUDE_DIR=/opt/conda/include/python3.6m -DPYTHON_LIBRARY=/opt/conda/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.4.0a0+11bb6e5 -DUSE_NUMPY=True -DWERROR=1 /var/lib/jenkins/workspace

See CircleCI build pytorch_macos_10_13_cuda10_0_cudnn7_py3_build (5/6)

Step: "Build" (details)

Dec 11 16:37:13 [2960/2964] Building CXX object modules/observers/CMakeFiles/caffe2_observers.dir/perf_observer.cc.o
Dec 11 16:37:13 ../modules/observers/perf_observer.cc:190:17: warning: comparison of integers of different signs: 'int' and 'unsigned int' [-Wsign-compare]
Dec 11 16:37:13   if (skipIters <= numRuns_ && sampleRate > 0 && rand() % sampleRate == 0) {
Dec 11 16:37:13       ~~~~~~~~~ ^  ~~~~~~~~
Dec 11 16:37:13 ../modules/observers/perf_observer.cc:232:27: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<caffe2::OperatorBase *, std::__1::allocator<caffe2::OperatorBase *> >::size_type' (aka 'unsigned long') [-Wsign-compare]
Dec 11 16:37:13     for (int idx = 0; idx < operators.size(); ++idx) {
Dec 11 16:37:13                       ~~~ ^ ~~~~~~~~~~~~~~~~
Dec 11 16:37:13 2 warnings generated.
Dec 11 16:37:13 [2961/2964] Linking CXX shared library lib/libcaffe2_observers.dylib
Dec 11 16:37:16 [2962/2964] Building CXX object caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o
Dec 11 16:37:16 FAILED: caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o 
Dec 11 16:37:16 /Users/distiller/workspace/clang++  -DAT_PARALLEL_OPENMP=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DIDEEP_USE_MKL -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTHP_BUILD_MAIN_LIB -DTH_BLAS_MKL -DUSE_C10D -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_NUMPY -D_FILE_OFFSET_BITS=64 -Dtorch_python_EXPORTS -Iaten/src -I../aten/src -I. -I../ -I../cmake/../third_party/benchmark/include -Icaffe2/contrib/aten -I../third_party/onnx -Ithird_party/onnx -I../third_party/foxi -Ithird_party/foxi -I../torch/.. -I../torch/../aten/src -I../torch/../aten/src/TH -Icaffe2/aten/src -Ithird_party -I../torch/../third_party/gloo -I../torch/../third_party/onnx -I../torch/csrc -I../torch/csrc/api/include -I../torch/lib -I../torch/lib/libshm -I../caffe2/../torch/csrc/api -I../caffe2/../torch/csrc/api/include -I../c10/.. -Ithird_party/ideep/mkl-dnn/include -I../third_party/ideep/mkl-dnn/src/../include -I../torch/lib/libshm/../../../torch/lib -I../torch/lib/c10d/.. -I../torch/lib/c10d/../../.. -isystem third_party/gloo -isystem ../cmake/../third_party/gloo -isystem ../cmake/../third_party/googletest/googlemock/include -isystem ../cmake/../third_party/googletest/googletest/include -isystem ../third_party/protobuf/src -isystem /Users/distiller/workspace/miniconda3/include -isystem ../third_party/gemmlowp -isystem ../third_party/neon2sse -isystem ../third_party -isystem ../cmake/../third_party/eigen -isystem /Users/distiller/workspace/miniconda3/include/python3.7m -isystem /Users/distiller/workspace/miniconda3/lib/python3.7/site-packages/numpy/core/include -isystem ../cmake/../third_party/pybind11/include -isystem /opt/rocm/hip/include -isystem /include -isystem ../third_party/ideep/mkl-dnn/include -isystem ../third_party/ideep/include -Wno-deprecated -fvisibility-inlines-hidden -Wno-deprecated-declarations -Xpreprocessor -fopenmp -I/usr/local/include -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -faligned-new -fno-math-errno -fno-trapping-math -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const -DHAVE_AVX_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3  -isysroot /Applications/Xcode-9.4.1.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.13.sdk -mmacosx-version-min=10.9 -fPIC   -DCAFFE2_USE_GLOO -DCUDA_HAS_FP16=1 -DHAVE_GCC_GET_CPUID -DUSE_AVX -DUSE_AVX2 -DTH_HAVE_THREAD -Wall -Wextra -Wno-unused-parameter -Wno-missing-field-initializers -Wno-write-strings -Wno-unknown-pragmas -Wno-missing-braces "-Xpreprocessor -fopenmp -I/usr/local/include" -std=gnu++14 -MD -MT caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o -MF caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o.d -o caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o -c ../torch/csrc/distributed/rpc/process_group_agent.cpp
Dec 11 16:37:16 ../torch/csrc/distributed/rpc/process_group_agent.cpp:32:24: error: non-static data member defined out-of-line
Dec 11 16:37:16     ProcessGroupAgent::kInfiniteTimeoutTimePoint =
Dec 11 16:37:16     ~~~~~~~~~~~~~~~~~~~^
Dec 11 16:37:16 1 error generated.
Dec 11 16:37:16 ninja: build stopped: subcommand failed.
Dec 11 16:37:16 Traceback (most recent call last):
Dec 11 16:37:16   File "setup.py", line 755, in <module>
Dec 11 16:37:16     build_deps()
Dec 11 16:37:16   File "setup.py", line 316, in build_deps

See CircleCI build pytorch_linux_xenial_cuda9_cudnn7_py3_build (6/6)

Step: "Build" (details)

Dec 12 00:43:03                             ^
Dec 12 00:43:03 In file included from /var/lib/jenkins/workspace/aten/src/ATen/Tensor.h:11:0,
Dec 12 00:43:03                  from /var/lib/jenkins/workspace/aten/src/ATen/Context.h:4,
Dec 12 00:43:03                  from /var/lib/jenkins/workspace/aten/src/ATen/ATen.h:5,
Dec 12 00:43:03                  from /var/lib/jenkins/workspace/torch/lib/c10d/ProcessGroup.hpp:10,
Dec 12 00:43:03                  from /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ddp.h:3,
Dec 12 00:43:03                  from /var/lib/jenkins/workspace/torch/csrc/distributed/c10d/ddp.cpp:1:
Dec 12 00:43:03 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:240:30: note: declared here
Dec 12 00:43:03    DeprecatedTypeProperties & type() const {
Dec 12 00:43:03                               ^
Dec 12 00:43:09 /var/lib/jenkins/workspace/torch/csrc/distributed/rpc/process_group_agent.cpp:32:24: error: 'const steady_clock_time_point torch::distributed::rpc::ProcessGroupAgent::kInfiniteTimeoutTimePoint' is not a static data member of 'class torch::distributed::rpc::ProcessGroupAgent'
Dec 12 00:43:09      ProcessGroupAgent::kInfiniteTimeoutTimePoint =
Dec 12 00:43:09                         ^
Dec 12 00:43:09 make[2]: *** [caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o] Error 1
Dec 12 00:43:09 caffe2/torch/CMakeFiles/torch_python.dir/build.make:1570: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/csrc/distributed/rpc/process_group_agent.cpp.o' failed
Dec 12 00:43:09 make[1]: *** [caffe2/torch/CMakeFiles/torch_python.dir/all] Error 2
Dec 12 00:43:09 CMakeFiles/Makefile2:13546: recipe for target 'caffe2/torch/CMakeFiles/torch_python.dir/all' failed
Dec 12 00:43:09 make: *** [all] Error 2
Dec 12 00:43:09 Makefile:140: recipe for target 'all' failed
Dec 12 00:43:09 -- Building version 1.4.0a0+e29a7c8
Dec 12 00:43:09 cmake -DBUILD_ENVIRONMENT=pytorch-linux-xenial-cuda9-cudnn7-py3-build -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/var/lib/jenkins/workspace/torch -DCMAKE_PREFIX_PATH=/opt/conda/lib/python3.6/site-packages -DCUDA_NVCC_EXECUTABLE=/opt/cache/lib/nvcc -DNUMPY_INCLUDE_DIR=/opt/conda/lib/python3.6/site-packages/numpy/core/include -DPYTHON_EXECUTABLE=/opt/conda/bin/python -DPYTHON_INCLUDE_DIR=/opt/conda/include/python3.6m -DPYTHON_LIBRARY=/opt/conda/lib/libpython3.6m.so.1.0 -DTORCH_BUILD_VERSION=1.4.0a0+e29a7c8 -DUSE_NUMPY=True -DWERROR=1 /var/lib/jenkins/workspace

1 failure not recognized by patterns:

Job Step
CircleCI caffe2_onnx_py3_6_clang7_ubuntu16_04_build Build

This comment was automatically generated by Dr. CI.
Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 2 times.

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

[ghstack-poisoned]
xush6528 added a commit that referenced this pull request Dec 12, 2019
Pull Request resolved: #30355

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

ghstack-source-id: 95409528

Differential Revision: [D5516149](https://our.internmc.facebook.com/intern/diff/D5516149/)

void ProcessGroupAgent::pollTimedOutRPCs() {
while (true) {
while (rpcRunning_.load()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to hold the lock here while checking rpcRunning? That's why there was the while(true) before

Copy link
Contributor Author

@xush6528 xush6528 Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma

The lock was to avoid ::shutdown() noitifying when the watchdog thread is not waiting on the cond var, thus avoid the watchdog thread missing the notification.

With the predicate, we can allow the watch dog missing the notification.
Even if the watch dog thread misses the notification, the predicate will evaluate to false (since rpcRunning_ has already been set to False), the thread will not wait on the cond var, and gets back to the loop condition and decides to exit the loop.

if (!rpcRunning_.load()) {
return;
if (shouldUpdateMinEndTime) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this continue, I guess this means processing of the timed out futures is going to be skipped until we do another round of updating the minEndTime? If there is a timed out future, then there will be some delay in processing it, do we need to have this continue?

Copy link
Contributor Author

@xush6528 xush6528 Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohan-varma

Sorry, I thought we are closed on #30355 (comment).

Two conditions it will stop the thread from waiting on the cond var.

  1. The cond var is notified. It means there is an earlier endTime key inserted into the map.
    So it will go back to the beginning of the loop, update the thread-local minEndTime and

    1. if now >= the latest minEndTime, it will process the futures immediately without waiting on the cond var.
    2. if now < the latest minEndTime, it will wait until minEndTime
  2. Time is up. now >= the latest minEndTime. It will wake from the wait and and process the timed out futures.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2488231.

@facebook-github-bot facebook-github-bot deleted the gh/xush6528/41/head branch December 15, 2019 15:16
wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Pull Request resolved: pytorch#30355

- Make processTimedOutFutures hold lock.
- Reduce unnecessary scan on future and future timeout maps.
- Reduce the scope of lock at a spot.
- Avoid repeatedly wake up if user set timeout = 0.

ghstack-source-id: 95409528

Test Plan:
# Unit tests

```
buck test mode/dev-nosan //caffe2/test:rpc_fork -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork\#binary.par -r test_rpc_timeouts
```

```
buck test mode/dev-nosan //caffe2/test:rpc_fork_thrift -- test_rpc_timeouts

buck-out/gen/caffe2/test/rpc_fork_thrift\#binary.par -r test_rpc_timeouts
```

Differential Revision: D5516149

fbshipit-source-id: 4bb0bd59fa31d9bfaef9f07ac0126782da17f762
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants