[Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test #47639

mrshenli · 2020-11-10T00:09:46Z

No description provided.

dr-ci · 2020-11-10T22:35:58Z

💊 CI failures summary and remediations

As of commit 38b7748 (more details on the Dr. CI page):

2/2 failures introduced in this PR

🕵️ 2 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

Nov 16 16:22:11 sccache: error: couldn't connect to server

Nov 16 16:22:11 +++ eval 'extract_trap_cmd ' 
Nov 16 16:22:11 ++++ extract_trap_cmd 
Nov 16 16:22:11 ++++ printf '%s\n' '' 
Nov 16 16:22:11 +++ printf '%s\n' cleanup 
Nov 16 16:22:11 ++ trap -- ' 
Nov 16 16:22:11 cleanup' EXIT 
Nov 16 16:22:11 ++ [[ pytorch-paralleltbb-linux-xenial-py3.6-gcc5.4-build != *pytorch-win-* ]] 
Nov 16 16:22:11 ++ which sccache 
Nov 16 16:22:11 ++ sccache --stop-server 
Nov 16 16:22:11 Stopping sccache server... 
Nov 16 16:22:11 sccache: error: couldn't connect to server 
Nov 16 16:22:11 sccache: caused by: Connection refused (os error 111) 
Nov 16 16:22:11 ++ true 
Nov 16 16:22:11 ++ rm /var/lib/jenkins/sccache_error.log 
Nov 16 16:22:11 rm: cannot remove '/var/lib/jenkins/sccache_error.log': No such file or directory 
Nov 16 16:22:11 ++ true 
Nov 16 16:22:11 ++ [[ pytorch-paralleltbb-linux-xenial-py3.6-gcc5.4-build == *rocm* ]] 
Nov 16 16:22:11 ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 
Nov 16 16:22:11 ++ SCCACHE_IDLE_TIMEOUT=1200 
Nov 16 16:22:11 ++ RUST_LOG=sccache::server=error 
Nov 16 16:22:11 ++ sccache --start-server

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (2/2)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun)

Nov 16 21:49:54 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future

Nov 16 21:49:54 At: 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 21:49:54  
Nov 16 21:49:54 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Nov 16 21:49:54  
Nov 16 21:49:54 At: 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 21:49:54  
Nov 16 21:49:54 [E request_callback_no_python.cpp:592] Received error while processing request type 2: RuntimeError: Can not pickle torch.futures.Future 
Nov 16 21:49:54  
Nov 16 21:49:54 At: 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(98): serialize 
Nov 16 21:49:54   /opt/conda/lib/python3.6/site-packages/torch/distributed/rpc/internal.py(150): serialize 
Nov 16 21:49:54  
Nov 16 21:49:54 [W tensorpipe_agent.cpp:504] RPC agent for worker3 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Nov 16 21:49:54 [W tensorpipe_agent.cpp:504] RPC agent for worker2 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown) 
Nov 16 21:49:54 [W tensorpipe_agent.cpp:504] RPC agent for worker1 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown) 
Nov 16 21:49:54 ok (1.228s) 
Nov 16 21:49:55   test_return_future_remote (__main__.TensorPipeRpcTestWithSpawn) ... [W tensorpipe_agent.cpp:504] RPC agent for worker2 encountered error when reading incoming request from worker0: EOF: end of file (this is expected to happen during shutdown)

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 13 times.

This reverts commit 1b95474.

facebook-github-bot added the cla signed label Nov 10, 2020

mrshenli mentioned this pull request Nov 10, 2020

Revert to use NCCL 2.7.8-1 #47638

Closed

mrshenli force-pushed the ci-all/mrshenli branch from 7371064 to 6508688 Compare November 10, 2020 22:24

mrshenli changed the title ~~[Test-Only][Don't Review] Revert to use NCCL 2.7.8-1~~ [Test-Only][Don't Review] Use test_distributed_spawn for multigpu test Nov 10, 2020

mrshenli mentioned this pull request Nov 10, 2020

Disable test_distributed_for for multigpu test env #47703

Closed

mrshenli force-pushed the ci-all/mrshenli branch from 6508688 to 6bd19e6 Compare November 10, 2020 22:53

Revert "Disable test_distributed_for for multigpu test env (#47703)"

38b7748

This reverts commit 1b95474.

mrshenli force-pushed the ci-all/mrshenli branch from 6bd19e6 to 38b7748 Compare November 16, 2020 16:15

mrshenli changed the title ~~[Test-Only][Don't Review] Use test_distributed_spawn for multigpu test~~ [Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test Nov 16, 2020

agolynski self-requested a review November 16, 2020 16:49

jaglinux mentioned this pull request Nov 17, 2020

[ROCm] Enable skipped distributed global tests #48023

Closed

mrshenli closed this Dec 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test #47639

[Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test #47639

Uh oh!

mrshenli commented Nov 10, 2020

Uh oh!

dr-ci bot commented Nov 10, 2020 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test #47639

[Test-Only][Don't Review] Re-enable test_distributed_fork.py in multigpu test #47639

Uh oh!

Conversation

mrshenli commented Nov 10, 2020

Uh oh!

dr-ci bot commented Nov 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

🕵️ 2 new failures recognized by patterns

pytorch_paralleltbb_linux_xenial_py3_6_gcc5_4_build (1/2)

pytorch_linux_xenial_cuda11_1_cudnn8_py3_gcc7_test (2/2)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dr-ci bot commented Nov 10, 2020 •

edited

Loading