Skip to content

Conversation

@goldsborough
Copy link
Contributor

@goldsborough goldsborough commented Jul 21, 2018

Before I can rewrite portions of the c10d DDP in C++ I need proper tests in place to make sure I am not breaking anything as I port code. There were no tests for the c10d DDP in place so I wrote some.

I refactored the c10d tests to derive some tests cases from a general MultiGPUTestCase and followed lots of patterns from test_distributed.py w.r.t. how tests are skipped (such that the main process doesn't initialize CUDA, which I found is a super important detail!!!).

I am largely unfamiliar with this code so feel free to scrutinize. The DDP test code itself is also largely taken from test_distributed.py but more inlined which I find easier to read.

Test Plan:

  1. Ran gloo backend tests successfully on my devgpu
  2. Ran NCCL (and gloo) backend tests on devfair successfully.

@apaszke @pietern @teng-li

This comment was marked as off-topic.

This comment was marked as off-topic.

@soumith
Copy link
Contributor

soumith commented Jul 21, 2018

@pytorchbot retest this please

@ezyang
Copy link
Contributor

ezyang commented Jul 23, 2018

test failures look real

05:04:25 ======================================================================
05:04:25 FAIL: test_nccl_backend (__main__.DistributedDataParallelTest)
05:04:25 ----------------------------------------------------------------------
05:04:25 Traceback (most recent call last):
05:04:25   File "test_c10d.py", line 190, in wrapper
05:04:25     self._join_processes(fn)
05:04:25   File "test_c10d.py", line 235, in _join_processes
05:04:25     self._check_return_codes()
05:04:25   File "test_c10d.py", line 244, in _check_return_codes
05:04:25     self.assertEqual(p.exitcode, first_process.exitcode)
05:04:25   File "/var/lib/jenkins/workspace/test/common.py", line 364, in assertEqual
05:04:25     super(TestCase, self).assertEqual(x, y, message)
05:04:25 AssertionError: None != 1 : 

@goldsborough
Copy link
Contributor Author

@teng-li @apaszke looking for quick feedback so that I can post more PRs that need this test

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldsborough has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

jramseyer pushed a commit to jramseyer/pytorch that referenced this pull request Jul 30, 2018
Summary:
Before I can rewrite portions of the c10d DDP in C++ I need proper tests in place to make sure I am not breaking anything as I port code. There were no tests for the c10d DDP in place so I wrote some.

I refactored the c10d tests to derive some tests cases from a general `MultiGPUTestCase` and followed lots of patterns from `test_distributed.py` w.r.t. how tests are skipped (such that the main process doesn't initialize CUDA, which I found is a super important detail!!!).

I am largely unfamiliar with this code so feel free to scrutinize. The DDP test code itself is also largely taken from `test_distributed.py` but more inlined which I find easier to read.
Pull Request resolved: pytorch#9670

Differential Revision: D8977724

Pulled By: goldsborough

fbshipit-source-id: 186eab38a72384d7992a2ec5c89f304ad42d5944
facebook-github-bot pushed a commit that referenced this pull request Aug 1, 2018
Summary:
This PR depends on the tests added in #9670. It moves the first, tiny function from the c10d DDP to C++: `dist_broadcast_coalesced`. Let me know if ` torch/csrc/distributed/c10d/ddp.h` will be a good place to put these rewritten functions.

pietern The controller you requested could not be found. apaszke
Pull Request resolved: #9729

Differential Revision: D8985308

Pulled By: goldsborough

fbshipit-source-id: dc459fe9040273714044152063585e746974752f
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
Before I can rewrite portions of the c10d DDP in C++ I need proper tests in place to make sure I am not breaking anything as I port code. There were no tests for the c10d DDP in place so I wrote some.

I refactored the c10d tests to derive some tests cases from a general `MultiGPUTestCase` and followed lots of patterns from `test_distributed.py` w.r.t. how tests are skipped (such that the main process doesn't initialize CUDA, which I found is a super important detail!!!).

I am largely unfamiliar with this code so feel free to scrutinize. The DDP test code itself is also largely taken from `test_distributed.py` but more inlined which I find easier to read.
Pull Request resolved: pytorch#9670

Differential Revision: D8977724

Pulled By: goldsborough

fbshipit-source-id: 186eab38a72384d7992a2ec5c89f304ad42d5944
goodlux pushed a commit to goodlux/pytorch that referenced this pull request Aug 15, 2018
Summary:
This PR depends on the tests added in pytorch#9670. It moves the first, tiny function from the c10d DDP to C++: `dist_broadcast_coalesced`. Let me know if ` torch/csrc/distributed/c10d/ddp.h` will be a good place to put these rewritten functions.

pietern The controller you requested could not be found. apaszke
Pull Request resolved: pytorch#9729

Differential Revision: D8985308

Pulled By: goldsborough

fbshipit-source-id: dc459fe9040273714044152063585e746974752f
@ezyang ezyang added the merged label Jun 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants