Skip to content

Fix path handling on Win32 in rendezvous.py#57000

Closed
skyline75489 wants to merge 2 commits intopytorch:masterfrom
skyline75489:chesterliu/dev/fix-rendezvous-path
Closed

Fix path handling on Win32 in rendezvous.py#57000
skyline75489 wants to merge 2 commits intopytorch:masterfrom
skyline75489:chesterliu/dev/fix-rendezvous-path

Conversation

@skyline75489
Copy link
Copy Markdown
Contributor

@skyline75489 skyline75489 commented Apr 27, 2021

Fixes test failure after #56598

Introduced by #45335. This PR make _file_rendezvous_handler accept file URL with Windows paths, for example file://C:\Users\me\temp and file://C:/Users/me/temp.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Apr 27, 2021

💊 CI failures summary and remediations

As of commit 045dd70 (more details on the Dr. CI page):



🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

AssertionError: False is not true : Scalars fai...ith rtol=1.3e-06 and atol=1e-05 is only 1.4278052!
======================================================================
FAIL [4.612s]: test_cudnn_multiple_threads_same_device (__main__.TestCuda)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 439, in wrapper
    fn(*args, **kwargs)
  File "test_cuda.py", line 2505, in test_cudnn_multiple_threads_same_device
    (2048 - test_iters) * (2048 - test_iters))
  File "C:\Users\circleci\project\build\win_tmp\build\torch\testing\_internal\common_utils.py", line 1371, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Scalars failed to compare as equal! Comparing 1687401.0 and 1098304 gives a difference of 589097.0, but the allowed difference with rtol=1.3e-06 and atol=1e-05 is only 1.4278052!

----------------------------------------------------------------------
Ran 159 tests in 79.360s

FAILED (failures=1, skipped=67)

Generating XML reports...
Generated XML report: test-reports\dist-gloo\test_cuda\TEST-TestCuda-20210427110024.xml
Generated XML report: test-reports\dist-gloo\test_cuda\TEST-TestCudaComm-20210427110024.xml
Traceback (most recent call last):

❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See CircleCI build pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 (1/1)

Step: "Run tests" (full log | diagnosis details | 🔁 rerun) ❄️

Apr 27 10:57:09 RuntimeError: Process 0 terminated or timed out after 110.08238077163696 seconds
Apr 27 10:57:09 ======================================================================
Apr 27 10:57:09 ERROR [110.105s]: test_nccl_high_priority_stream (__main__.TestDistBackendWithSpawn)
Apr 27 10:57:09 ----------------------------------------------------------------------
Apr 27 10:57:09 Traceback (most recent call last):
Apr 27 10:57:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 374, in wrapper
Apr 27 10:57:09     self._join_processes(fn)
Apr 27 10:57:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 566, in _join_processes
Apr 27 10:57:09     self._check_return_codes(elapsed_time)
Apr 27 10:57:09   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 614, in _check_return_codes
Apr 27 10:57:09     raise RuntimeError('Process {} terminated or timed out after {} seconds'.format(i, elapsed_time))
Apr 27 10:57:09 RuntimeError: Process 0 terminated or timed out after 110.08238077163696 seconds
Apr 27 10:57:09 
Apr 27 10:57:09 ----------------------------------------------------------------------
Apr 27 10:57:09 Ran 196 tests in 657.286s
Apr 27 10:57:09 
Apr 27 10:57:09 FAILED (errors=4, skipped=117)
Apr 27 10:57:09 
Apr 27 10:57:09 Generating XML reports...
Apr 27 10:57:09 Generated XML report: test-reports/dist-nccl/distributed.test_distributed_spawn/TEST-TestDistBackendWithSpawn-20210427104612.xml
Apr 27 10:57:09 Traceback (most recent call last):
Apr 27 10:57:09   File "test/run_test.py", line 1156, in <module>

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@gunandrose4u
Copy link
Copy Markdown
Contributor

LGTM

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@seemethere has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@seemethere
Copy link
Copy Markdown
Member

LGTM, awaiting an approval from someone from the distributed team

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@seemethere merged this pull request in e31265d.

krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Fixes test failure after pytorch#56598

Introduced by pytorch#45335.

Pull Request resolved: pytorch#57000

Reviewed By: zou3519

Differential Revision: D28030360

Pulled By: seemethere

fbshipit-source-id: 4871d51e6b80dceef8bf95c6c658441287575f63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants