-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[c10d] Reduce test time by reusing ProcessGroup #125648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125648
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit fb6030e with merge base 6ebec38 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
pg = self._create_process_group_nccl(store, self.opts())
pg.allreduce(tensor)
...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?
If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
self._spawn_processes()
def tearDown(self):
for p in self.processes:
p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.
## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do:
```
def setUpClass(self):
dist.init_process_group(...)
def tearDownClass(self):
dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**
Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.
Point is, launch the processes only once.
## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.
## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.
## Next step
Migrate all the tests of distributed!
cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k
[ghstack-poisoned]
|
@seemethere @albanD can you please unblock this PR? Majority of line changes are due to moving tests from one file to another. |
|
How does this work when a test errors on all ranks? Or only on some ranks? In particular, I am curious if we do not spawn new processes per test, then can we recover from one test erroring and continuing through to others in the same class? |
|
Sounds good to skip PR sanity check as this is code movement! |
Not sure if I understand. If the user launch the test with
Good question. We discussed about this in yesterday's weekly meeting.
|
I want to check if this new UX has a difference from existing. Today, I can run a test file, and I can get a summary at the end of which tests passed and which tests failed. If I understand correctly, with this new approach from this PR, once the first test fails, all subsequent tests in the same class now fail. There seems there may be some loss in test signal? (E.g. the CI "keep-going" label no longer applies in the same way for these tests) If so, I would argue that this signal is important during debugging, and it is worth considering the implication on developer experience. |
## Problem this PR resolves
Today, most of distributed tests are arranged like this:
```
def test_allreduce(self):
pg = self._create_process_group_nccl(store, self.opts())
pg.allreduce(tensor)
...
```
Thus, we are paying PG creation time **per test**. That's bad. But why were we doing that? Is there a constraint?
If we look deeper, we would find that most of our test cases inherit from `torch.testing._internal.common_distributed.MultiProcessTestCase`. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in its `setUp()` and `tearDown()` methods, which basically do the following:
```
def setUp(self):
self._spawn_processes()
def tearDown(self):
for p in self.processes:
p.terminate()
```
Since `setUp` and `tearDown` are "**test-scope fixtures"**, meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.
## How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python `unittest` provides such a helper, called **"class-scope fixtures."** It is embodied by a `setUpClass` method and a `tearDownClass` method (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do:
```
classmethod
def setUpClass(self):
dist.init_process_group(...)
classmethod
def tearDownClass(self):
dist.destroy_process_group()
```
**In this PR, we create a new test template for distributed: `MultiProcContinousTest`, to hold this class-scope fixture.**
Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
1. launch the whole test file with `torchrun --nproc-per-node=...` or
2. use `mp.spawn()` under `if __name__ == "__main__":`.
Point is, launch the processes only once.
## Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
```
$ python test_c10d_nccl.py -k ProcessGroupNCCLTest
Ran 24 tests in 174.457s
```
After this PR:
```
$ torchrun --nproc-per-node 2 test_c10d_ops_nccl.py
or
$ python test_c10d_ops_nccl.py
Ran 24 tests in 16.247s
```
10X speedup.
## Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.
## Next step
Migrate all the tests of distributed!
cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k albanD
[ghstack-poisoned]
|
That's a fair concern. As described in the PR, we are moving a bunch of "positive" tests out. Though there may still be non-zero chance that a "positive" test would corrupt the PG, the percentage is relatively low among CI activities of all PRs. One step back, if any op indeed corrupts the PG, there is a high chance that other ops suffer the same issue. In that case, I am not sure if --keep-going is a good fit. For example, in the old case, the test may need to go through 300 timeouts before a failure signal is returned; whereas today our CI will return that signal as soon as the first timeout. |
|
One possible middle ground is to offer a way to switch the behavior back to the old way. Would it be useful if after noticing a cascading failure on CI, a user could use an env or flag to locally make the test run in the slower/sequential way? I prefer each test to be isolated and create their own PG, and if we could make PG init fast enough I'd rather go that route. But a 10x reduction in test time is too good to ignore, and would help all of us most of the time. |
|
@kwen2501 That makes sense for this set of tests! I was just thinking about distributed tests in general where we may not make the same assumption about one test failing being indicative of other tests failing for the same reason (since the PR description says to migrate all distributed tests). To check my understanding: is the only issue PG corruption (what causes PG corruption)? What if one test fails a numeric assertion? Would the rest of the tests no longer run, or is there a way for the process to continue? What if the assertion only fails on one rank? |
Numeric assertion is a "user behavior" so it does not bother the PG in any manner. So, the Test class should be able to continue with the rest of tests healthy, by default.
Like, the test was deliberately testing comm abort, comm destroy, etc. |
I agree. This PR did not remove the old I am just:
|
wconstab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should ship this,
it will be a process anyway- folks will migrate tests, we'll see if any real usability issues pop up.
I would appreciate a way (a flag) to go back to the old behavior even with the new test class- hopefully thats easy to implement. Then we can at least have an escape hatch if we get stuck down the line
|
@wconstab thanks for the suggestion! |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Problem this PR resolves
Today, most of distributed tests are arranged like this:
Thus, we are paying PG creation time per test. That's bad. But why were we doing that? Is there a constraint?
If we look deeper, we would find that most of our test cases inherit from
torch.testing._internal.common_distributed.MultiProcessTestCase. From the name, nothing seems wrong, and probably fits distributed well. But a "problem" exists in itssetUp()andtearDown()methods, which basically do the following:Since
setUpandtearDownare "test-scope fixtures", meaning, they are called per test, each test will have brand new processes. Of course we'd have to recreate ProcessGroup every time.How we are fixing it
First, obviously, we need to put a PG's lifetime into a longer scope. Python
unittestprovides such a helper, called "class-scope fixtures." It is embodied by asetUpClassmethod and atearDownClassmethod (note the name difference), which are called only once for all tests in the same test class. Therefore, we would do:In this PR, we create a new test template for distributed:
MultiProcContinousTest, to hold this class-scope fixture.Second, we'd need to avoid per-test process spawn and terminate. That's easy, we can either:
torchrun --nproc-per-node=...ormp.spawn()underif __name__ == "__main__":.Point is, launch the processes only once.
Result
We moved the "positive tests" from test_c10d_nccl.py to test_c10d_ops_nccl.py.
Before this PR:
After this PR:
10X speedup.
Limitation
For tests intended to test destroy or abort of PGs, we'd need to go back to the old style. So it would make sense to divide our tests into two classes: one for positive tests where we would reuse the PGs, and the other one for abort/destroy and negative tests like watchdog timeout.
Next step
Migrate the tests of distributed that would fit with this test style!
Stack from ghstack (oldest at bottom):
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @albanD