Skip to content

Conversation

@mrshenli
Copy link
Contributor

No description provided.

@mrshenli mrshenli changed the title [Test-Only][Don't Review] Revert to use NCCL 2.7.8-1 [Test-Only][Don't Review] Use test_distributed_spawn for multigpu test Nov 10, 2020
@dr-ci
Copy link

dr-ci bot commented Nov 10, 2020

💊 CI failures summary and remediations

As of commit 38b7748 (more details on the Dr. CI page):


  • 2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Nov 16 16:22:11 sccache: error: couldn't connect to server
Nov 16 16:22:11 +++ eval 'extract_trap_cmd ' 
Nov 16 16:22:11 ++++ extract_trap_cmd 
Nov 16 16:22:11 ++++ printf '%s\n' '' 
Nov 16 16:22:11 +++ printf '%s\n' cleanup 
Nov 16 16:22:11 ++ trap -- ' 
Nov 16 16:22:11 cleanup' EXIT 
Nov 16 16:22:11 ++ [[ pytorch-paralleltbb-linux-xenial-py3.6-gcc5.4-build != *pytorch-win-* ]] 
Nov 16 16:22:11 ++ which sccache 
Nov 16 16:22:11 ++ sccache --stop-server 
Nov 16 16:22:11 Stopping sccache server... 
Nov 16 16:22:11 sccache: error: couldn't connect to server 
Nov 16 16:22:11 sccache: caused by: Connection refused (os error 111) 
Nov 16 16:22:11 ++ true 
Nov 16 16:22:11 ++ rm /var/lib/jenkins/sccache_error.log 
Nov 16 16:22:11 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Nov 16 16:22:11 ++ true 
Nov 16 16:22:11 ++ [[ pytorch-paralleltbb-linux-xenial-py3.6-gcc5.4-build == *rocm* ]] 
Nov 16 16:22:11 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Nov 16 16:22:11 ++ SCCACHE_IDLE_TIMEOUT=1200 
Nov 16 16:22:11 ++ RUST_LOG=sccache::server=error 
Nov 16 16:22:11 ++ sccache --start-server 

See CircleCI build pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Nov 16 21:49:54 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future
Nov 16 21:49:54 At: 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 21:49:54  
Nov 16 21:49:54 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Nov 16 21:49:54  
Nov 16 21:49:54 At: 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 21:49:54  
Nov 16 21:49:54 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Nov 16 21:49:54  
Nov 16 21:49:54 At: 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 21:49:54  
Nov 16 21:49:54 [W tensorpipe_agent.cpp:504] RPC agent for worker3 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Nov 16 21:49:54 [W tensorpipe_agent.cpp:504] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown) 
Nov 16 21:49:54 [W tensorpipe_agent.cpp:504] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Nov 16 21:49:54 ok (1.228s) 
Nov 16 21:49:55   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... [W tensorpipe_agent.cpp:504] RPC agent for worker2 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

@mrshenli mrshenli changed the title [Test-Only][Don't Review] Use test_distributed_spawn for multigpu test [Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test Nov 16, 2020
@agolynski agolynski self-requested a review November 16, 2020 16:49
@mrshenli mrshenli closed this Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants