[c10d] Reduce test time by reusing ProcessGroup #125648

kwen2501 · 2024-05-06T23:49:50Z

Problem this PR resolves

Today, most of distributed tests are arranged like this:

def test_allreduce(self):
    pg = self._create_process_group_nccl(store, self.opts())
    pg.allreduce(tensor)
    ...

Thus, we are paying PG creation time per test. That's bad. But why were we doing that? Is there a constraint?

If we look deeper, we would find that most of our test cases inherit from torch.testing._internal.common_distributed.MultiProcessTestCase. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its setUp() and tearDown() methods, which basically do the following:

def setUp(self):
    self._spawn_processes()

def tearDown(self):
    for p in self.processes:
        p.terminate()

Since setUp and tearDown are "test-scope fixtures", meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.

How we are fixing it

First, obviously, we need to put a PG's lifetime into a longer scope. Python unittest provides such a helper, called "class-scope fixtures." It is embodied by a setUpClass method and a tearDownClass method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do:

@classmethod
def setUpClass(self):
    dist.init_process_group(...)

@classmethod
def tearDownClass(self):
    dist.destroy_process_group()

In this PR, we create a new test template for distributed: MultiProcContinousTest, to hold this class-scope fixture.

Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:

launch the whole test file with torchrun --nproc-per-node=... or
use mp.spawn() under if __name__ == "__main__":.

Point is, launch the processes only once.

Result

We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:

$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s

After this PR:

$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s

10X speedup.

Limitation

For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.

Next step

Migrate the tests of distributed that would fit with this test style!

Stack from ghstack (oldest at bottom):

-> [c10d] Reduce test time by reusing ProcessGroup #125648

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @albanD

[ghstack-poisoned]

pytorch-bot · 2024-05-06T23:49:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125648

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit fb6030e with merge base 6ebec38 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh) (similar failure)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Problem this PR resolves Today, most of distributed tests are arranged like this: ``` def test_allreduce(self): pg = self._create_process_group_nccl(store, self.opts()) pg.allreduce(tensor) ... ``` Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint? If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following: ``` def setUp(self): self._spawn_processes() def tearDown(self): for p in self.processes: p.terminate() ``` Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time. ## How we are fixing it First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do: ``` def setUpClass(self): dist.init_process_group(...) def tearDownClass(self): dist.destroy_process_group() ``` **In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.** Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either: 1. launch the whole test file with `torchrun --nproc-per-node=...` or 2. use `mp.spawn()` under `if __name__ == "__main__":`. Point is, launch the processes only once. ## Result We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py. Before this PR: ``` $ python test_c10d_nccl.py -k ProcessGroupNCCLTest Ran 24 tests in 174.457s ``` After this PR: ``` $ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py or $ python test_c10d_ops_nccl.py Ran 24 tests in 16.247s ``` 10X speedup. ## Limitation For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout. ## Next step Migrate all the tests of distributed! cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

kwen2501 · 2024-05-07T00:27:52Z

@seemethere @albanD can you please unblock this PR?

INFO: PR SIZE is 2045

Majority of line changes are due to moving tests from one file to another.

awgu · 2024-05-07T15:07:06Z

How does this work when a test errors on all ranks? Or only on some ranks?

In particular, I am curious if we do not spawn new processes per test, then can we recover from one test erroring and continuing through to others in the same class?

albanD · 2024-05-07T15:37:18Z

Sounds good to skip PR sanity check as this is code movement!

kwen2501 · 2024-05-07T18:31:49Z

How does this work when a test errors on all ranks? Or only on some ranks?

Not sure if I understand. If the user launch the test with python test.py, same mechanism is used to spawn the processes as today. If the user uses torchrun to launch the test, torchrun has mechanism to gather errors from all ranks.

In particular, I am curious if we do not spawn new processes per test, then can we recover from one test erroring and continuing through to others in the same class?

Good question. We discussed about this in yesterday's weekly meeting.
It depends:

If the error in a previous test is not fatal, such as assertEqual == False, then the test class would be able to continue with a next test.
If the error in a previous test is fatal, and as a result, the PG or NCCL comm got corrupted, then most likely the later tests would all fail with error like "NCCL communicator has been aborted/corrupted." That should be okay, as the developer would come back to look for the first failure, and fix that.

@awgu

awgu · 2024-05-07T18:40:39Z

How does this work when a test errors on all ranks? Or only on some ranks?

Not sure if I understand. If the user launch the test with python test.py, same mechanism is used to spawn the processes as today. If the user uses torchrun to launch the test, torchrun has mechanism to gather errors from all ranks.

In particular, I am curious if we do not spawn new processes per test, then can we recover from one test erroring and continuing through to others in the same class?

Good question. We discussed about this in yesterday's weekly meeting.

It depends:

If the error in a previous test is not fatal, such as assertEqual == False, then the test class would be able to continue with a next test.

If the error in a previous test is fatal, and as a result, the PG or NCCL comm got corrupted, then most likely the later tests would all fail with error like "NCCL communicator has been aborted/corrupted." That should be okay, as the developer would come back to look for the first failure, and fix that.

@awgu

I want to check if this new UX has a difference from existing. Today, I can run a test file, and I can get a summary at the end of which tests passed and which tests failed. If I understand correctly, with this new approach from this PR, once the first test fails, all subsequent tests in the same class now fail. There seems there may be some loss in test signal? (E.g. the CI "keep-going" label no longer applies in the same way for these tests)

If so, I would argue that this signal is important during debugging, and it is worth considering the implication on developer experience.

## Problem this PR resolves Today, most of distributed tests are arranged like this: ``` def test_allreduce(self): pg = self._create_process_group_nccl(store, self.opts()) pg.allreduce(tensor) ... ``` Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint? If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following: ``` def setUp(self): self._spawn_processes() def tearDown(self): for p in self.processes: p.terminate() ``` Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time. ## How we are fixing it First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do: ``` classmethod def setUpClass(self): dist.init_process_group(...) classmethod def tearDownClass(self): dist.destroy_process_group() ``` **In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.** Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either: 1. launch the whole test file with `torchrun --nproc-per-node=...` or 2. use `mp.spawn()` under `if __name__ == "__main__":`. Point is, launch the processes only once. ## Result We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py. Before this PR: ``` $ python test_c10d_nccl.py -k ProcessGroupNCCLTest Ran 24 tests in 174.457s ``` After this PR: ``` $ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py or $ python test_c10d_ops_nccl.py Ran 24 tests in 16.247s ``` 10X speedup. ## Limitation For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout. ## Next step Migrate all the tests of distributed! cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k albanD [ghstack-poisoned]

ghstack-source-id: c68462f Pull Request resolved: #125648

kwen2501 · 2024-05-07T21:57:39Z

That's a fair concern.

As described in the PR, we are moving a bunch of "positive" tests out. Though there may still be non-zero chance that a "positive" test would corrupt the PG, the percentage is relatively low among CI activities of all PRs. One step back, if any op indeed corrupts the PG, there is a high chance that other ops suffer the same issue. In that case, I am not sure if --keep-going is a good fit. For example, in the old case, the test may need to go through 300 timeouts before a failure signal is returned; whereas today our CI will return that signal as soon as the first timeout.

wconstab · 2024-05-08T13:28:39Z

One possible middle ground is to offer a way to switch the behavior back to the old way. Would it be useful if after noticing a cascading failure on CI, a user could use an env or flag to locally make the test run in the slower/sequential way?

I prefer each test to be isolated and create their own PG, and if we could make PG init fast enough I'd rather go that route. But a 10x reduction in test time is too good to ignore, and would help all of us most of the time.

awgu · 2024-05-08T13:31:35Z

@kwen2501 That makes sense for this set of tests! I was just thinking about distributed tests in general where we may not make the same assumption about one test failing being indicative of other tests failing for the same reason (since the PR description says to migrate all distributed tests).

To check my understanding: is the only issue PG corruption (what causes PG corruption)? What if one test fails a numeric assertion? Would the rest of the tests no longer run, or is there a way for the process to continue? What if the assertion only fails on one rank?

kwen2501 · 2024-05-08T18:39:56Z

@awgu

What if one test fails a numeric assertion?

Numeric assertion is a "user behavior" so it does not bother the PG in any manner. So, the Test class should be able to continue with the rest of tests healthy, by default.

what causes PG corruption

Like, the test was deliberately testing comm abort, comm destroy, etc.
Or, the test hit a timeout -- whether we abort the PG or not, the PG is corrupted, bc today there is no way to back out an operation.

kwen2501 · 2024-05-08T18:44:29Z

I was just thinking about distributed tests in general where we may not make the same assumption

I agree. This PR did not remove the old MultiProcessTestCase.
If any distributed test works better with the old behavior, please feel free to use this existing TestCase.

I am just:

providing an alternative; and
calling people to be mindful when they choose the base Test class.

@awgu @wconstab

wconstab

i think we should ship this,

it will be a process anyway- folks will migrate tests, we'll see if any real usability issues pop up.

I would appreciate a way (a flag) to go back to the old behavior even with the new test class- hopefully thats easy to implement. Then we can at least have an escape hatch if we get stuck down the line

kwen2501 · 2024-05-08T19:13:01Z

@wconstab thanks for the suggestion!
If we add a flag, we can (might as well) merge the two classes into one, which makes our test infra simpler.
Good idea :) But one thing at a time.
Thanks for the review :)

kwen2501 · 2024-05-08T19:18:15Z

@pytorchbot merge

pytorchmergebot · 2024-05-08T19:20:02Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

[c10d] Reduce test time by reusing ProcessGroup

cc451a6

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category ci-td-distributed labels May 6, 2024

kwen2501 requested review from XilunWu, clee2000, seemethere, shuqiangzhang, wconstab and wz337 May 7, 2024 00:17

albanD added the skip-pr-sanity-checks label May 7, 2024

kwen2501 added a commit that referenced this pull request May 7, 2024

[c10d] Reduce test time by reusing ProcessGroup

863bf51

ghstack-source-id: c68462f Pull Request resolved: #125648

wconstab approved these changes May 8, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 8, 2024

pytorchmergebot added the merging label May 8, 2024

pytorchmergebot added the Merged label May 8, 2024

pytorchmergebot closed this in 04bf771 May 8, 2024

pytorchmergebot removed the merging label May 8, 2024

github-actions bot deleted the gh/kwen2501/22/head branch June 8, 2024 01:55

ankurneog mentioned this pull request Aug 14, 2024

Generalization of distributed UT content to enable non cuda device execution #131758

Closed

[c10d] Reduce test time by reusing ProcessGroup #125648

[c10d] Reduce test time by reusing ProcessGroup #125648

Uh oh!

Conversation

kwen2501 commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem this PR resolves

How we are fixing it

Result

Limitation

Next step

Uh oh!

pytorch-bot bot commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125648

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

kwen2501 commented May 7, 2024

Uh oh!

awgu commented May 7, 2024

Uh oh!

albanD commented May 7, 2024

Uh oh!

kwen2501 commented May 7, 2024

Uh oh!

awgu commented May 7, 2024

Uh oh!

kwen2501 commented May 7, 2024

Uh oh!

wconstab commented May 8, 2024

Uh oh!

awgu commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kwen2501 commented May 8, 2024

Uh oh!

kwen2501 commented May 8, 2024

Uh oh!

wconstab left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 8, 2024

Uh oh!

kwen2501 commented May 8, 2024

Uh oh!

pytorchmergebot commented May 8, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kwen2501 commented May 6, 2024 •

edited

Loading

pytorch-bot bot commented May 6, 2024 •

edited

Loading

awgu commented May 8, 2024 •

edited

Loading