Skip to content

Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py#56598

Closed
pbelevich wants to merge 1 commit intopytorch:masterfrom
pbelevich:export-D27913170
Closed

Split test_c10d.py to test_c10d_common.py, test_c10d_gloo.py, test_c10d_nccl.py#56598
pbelevich wants to merge 1 commit intopytorch:masterfrom
pbelevich:export-D27913170

Conversation

@pbelevich
Copy link
Copy Markdown
Contributor

Test Plan: NA

Differential Revision: D27913170

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Apr 21, 2021

💊 CI failures summary and remediations

As of commit 751bf5e (more details on the Dr. CI page):


  • 5/7 failures possibly* introduced in this PR
    • 1/5 non-scanned failure(s)
  • 2/7 broken upstream at merge base 76fbd75 since Apr 21

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_windows_vs2019_py36_cuda10.1_test1 (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: Found no NVIDIA driver on your sy...ver from http://www.nvidia.com/Download/index.aspx
1554:16cc @ 00550265 - LdrpGetProcedureAddress - INFO: Locating procedure "FlsGetValue" by name
(1554.16cc): C++ EH exception - code e06d7363 (first chance)
(1554.16cc): C++ EH exception - code e06d7363 (first chance)
(1554.16cc): C++ EH exception - code e06d7363 (first chance)
(1554.16cc): C++ EH exception - code e06d7363 (first chance)
(1554.16cc): C++ EH exception - code e06d7363 (first chance)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\circleci\project\build\win_tmp\build\torch\cuda\__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
1554:16cc @ 00550312 - LdrLoadDll - ENTER: DLL name: api-ms-win-appmodel-runtime-l1-1-2
1554:16cc @ 00550312 - LdrpPreprocessDllName - INFO: DLL api-ms-win-appmodel-runtime-l1-1-2 was redirected to C:\Windows\SYSTEM32\kernel.appcore.dll by API set
1554:16cc @ 00550312 - LdrpLoadDllInternal - ENTER: DLL name: C:\Windows\SYSTEM32\kernel.appcore.dll
1554:16cc @ 00550328 - LdrpFindKnownDll - ENTER: DLL name: kernel.appcore.dll
1554:16cc @ 00550328 - LdrpFindKnownDll - RETURN: Status: 0x00000000
1554:16cc @ 00550328 - LdrpMinimalMapModule - ENTER: DLL name: C:\Windows\System32\kernel.appcore.dll
ModLoad: 00007ffb`fd1a0000 00007ffb`fd1b1000   C:\Windows\System32\kernel.appcore.dll
1554:16cc @ 00550328 - LdrpMinimalMapModule - RETURN: Status: 0x00000000
1554:16cc @ 00550328 - LdrpFindDllActivationContext - INFO: Probing for the manifest of DLL "C:\Windows\System32\kernel.appcore.dll" failed with status 0xc000008a
1554:16cc @ 00550328 - LdrpPreprocessDllName - INFO: DLL api-ms-win-core-profile-l1-1-0.dll was redirected to C:\Windows\SYSTEM32\kernelbase.dll by API set

3 failures not recognized by patterns:

Job Step Action
CircleCI pytorch_linux_xenial_py3_clang5_asan_test2 Run tests 🔁 rerun
CircleCI pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test2 Run tests 🔁 rerun
CircleCI pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test1 Run tests 🔁 rerun

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet.

🚧 1 fixed upstream failure:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

If your commit is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D27913170

1 similar comment
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D27913170

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D27913170

@wayi1
Copy link
Copy Markdown
Contributor

wayi1 commented Apr 21, 2021

Thanks for the refactoring! This will definitely make the test codebase much more manageable.

On the other hand, I think this PR does a bit more than plain splitting. Therefore, it will be better to add some descriptions in the PR to highlight some changes, e.g., creating some abstract classes like AbstractCommTest, AbstractDistributedDataParallelTest and AbstractTimeoutTest.

Another suggestion is separating out many fundamental store-related unit tests into a separate file, as these tests and other distributed training tests target different layers. This can be done in a fllow-up PR if you'd like.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this test in gloo rather than common?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 181:

        c10d.init_process_group(backend="gloo", init_method="env://")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add @requires_gloo() here.

Copy link
Copy Markdown
Contributor

@wayi1 wayi1 Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation here and a few "ValueError" lines and "RuntimeError" lines are changed. Previously it's aligned with the end of "with". Is this intentional?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation changed.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D27913170

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D27913170

…0d_nccl.py (pytorch#56598)

Summary: Pull Request resolved: pytorch#56598

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913170

fbshipit-source-id: 4f2c09037b94b37b87545fd2b48ed6502c47433b
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D27913170

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 5cc75e4.

wayi1 pushed a commit that referenced this pull request Apr 22, 2021
The process group is created on a gloo backend.

Context: #56598 (comment)

Differential Revision: [D27936805](https://our.internmc.facebook.com/intern/diff/D27936805/)

ghstack-source-id: 127150910
Pull Request resolved: #56682
wayi1 pushed a commit that referenced this pull request Apr 22, 2021
facebook-github-bot pushed a commit that referenced this pull request Apr 22, 2021
Summary:
Pull Request resolved: #56682

The process group is created on a gloo backend.

Context: #56598 (comment)
ghstack-source-id: 127150910

Test Plan: buck test caffe2/test/distributed:c10d -- test_logging_init

Reviewed By: pbelevich

Differential Revision: D27936805

fbshipit-source-id: 932efc638f94bdf78ddbae291e3720a20e43f2af
facebook-github-bot pushed a commit that referenced this pull request Apr 29, 2021
Summary:
Fixes test failure after #56598

Introduced by #45335.

Pull Request resolved: #57000

Reviewed By: zou3519

Differential Revision: D28030360

Pulled By: seemethere

fbshipit-source-id: 4871d51e6b80dceef8bf95c6c658441287575f63
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
…0d_nccl.py (pytorch#56598)

Summary: Pull Request resolved: pytorch#56598

Test Plan: NA

Reviewed By: SciPioneer

Differential Revision: D27913170

fbshipit-source-id: 3439d18141131b02d55f2ca399a4c795cba2b04b
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56682

The process group is created on a gloo backend.

Context: pytorch#56598 (comment)
ghstack-source-id: 127150910

Test Plan: buck test caffe2/test/distributed:c10d -- test_logging_init

Reviewed By: pbelevich

Differential Revision: D27936805

fbshipit-source-id: 932efc638f94bdf78ddbae291e3720a20e43f2af
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Fixes test failure after pytorch#56598

Introduced by pytorch#45335.

Pull Request resolved: pytorch#57000

Reviewed By: zou3519

Differential Revision: D28030360

Pulled By: seemethere

fbshipit-source-id: 4871d51e6b80dceef8bf95c6c658441287575f63
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants