-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Generalization of distributed test cases for non CUDA devices #136988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalization of distributed test cases for non CUDA devices #136988
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136988
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3ebea74 with merge base 8dddd45 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
|
@pytorchbot label "topic: not user facing" |
kwen2501
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
I didn't look closely, but it seems there are a lot of replacements from cuda to hpu. I wonder if this PR is a demo and not for landing?
| "importerror": TestSkip(88, "Test skipped due to missing import"), | ||
| "no_hpu": TestSkip(88, "HPU is not available."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for using the same skip code?
| backend_feature["hpu"] = {"hccl"} | ||
| backend_feature["plugin"] = set() | ||
| if TEST_HPU: | ||
| backend_feature["hpu"] = {"hccl"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc @wesbland @pavanbalaji on the name.
| ) | ||
| import operator | ||
|
|
||
| import habana_frameworks.torch as ht |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure about this direct import. Is there a better way to do it?
| if not torch.cuda.is_available(): | ||
| sys.exit(TEST_SKIPS["no_cuda"].exit_code) | ||
| if not ht.hpu.is_available(): | ||
| sys.exit(TEST_SKIPS["no_hpu"].exit_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing the cuda lines?
| if torch.cuda.is_available() and torch.cuda.device_count() >= x: | ||
| if ht.hpu.is_available() and ht.hpu.device_count() >= x: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
| return torch.cuda.device_count() | ||
| return ht.hpu.device_count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
…ch#136906) Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode. Pull Request resolved: pytorch#136906 Approved by: https://github.com/chenyang78
Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9. Pull Request resolved: pytorch#136904 Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78
Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#134545 Approved by: https://github.com/ezyang
…pytorch#136670) Fixes pytorch#136640 Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1. In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately. In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard: ``` Eq((64//((2048//(s3*((s2//s3))))))), 1) ``` I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True. I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues: (1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions (2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though. Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)` Pull Request resolved: pytorch#136670 Approved by: https://github.com/ezyang
…ses) (pytorch#136759) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see pytorch#136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: pytorch#136759 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#136670
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: pytorch#136760 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#136670, pytorch#136759
Summary: sam_fast changes from timeout to fail_to_run after pytorch#136591, which "regressed" in a good way. Update the expected result file and continue investigating. Pull Request resolved: pytorch#136996 Approved by: https://github.com/ezyang
Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast. Tested in pytorch#136987 Pull Request resolved: pytorch#136981 Approved by: https://github.com/Skylion007
…rch#136982) Just adds instantiation of the kernels and sometimes explicit cast. Tested in pytorch#136987 Pull Request resolved: pytorch#136982 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#136981
By simply adding explicit instantiation Tested in pytorch#136987 Pull Request resolved: pytorch#136983 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#136981, pytorch#136982
…pytorch#123514)" This reverts commit 6931c16. Reverted pytorch#123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk https://hud.pytorch.org/pytorch/pytorch/commit/6931c1644afdba53e63ce5671455e4e1b7265dd9 ([comment](pytorch#123514 (comment)))
By adding explicit instantiation. Tested in pytorch#136987 Pull Request resolved: pytorch#136984 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#136981, pytorch#136982, pytorch#136983
Pull Request resolved: pytorch#136985 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#136981, pytorch#136982, pytorch#136983, pytorch#136984
Pull Request resolved: pytorch#136986 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#136981, pytorch#136982, pytorch#136983, pytorch#136984, pytorch#136985
Pull Request resolved: pytorch#136901 Approved by: https://github.com/albanD
…ch#136972) Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: pytorch#136972 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#136917, pytorch#136934, pytorch#136935
Summary: Some AOTI tensor constants may be model buffers that never needs to be updated. Differential Revision: D62777502 Pull Request resolved: pytorch#136770 Approved by: https://github.com/muchulee8
By migrating some of the workflows to Python-3.9 as 3.8 has been deprecated by pytorch#132138 Pull Request resolved: pytorch#137023 Approved by: https://github.com/ZainRizvi, https://github.com/janeyx99, https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
…orch#134247) This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one. Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model. (The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.) For example: ``` class FeedForward(nn.Module): def __init__(self, config: TransformerArgs) -> None: super().__init__() w1 = nn.Linear(config.dim, config.hidden_dim, bias=False) w2 = nn.Linear(config.hidden_dim, config.dim, bias=False) w3 = nn.Linear(config.dim, config.hidden_dim, bias=False) self.w1 = parallelize_module(w1, Colwise) self.w2 = parallelize_module(w2, Rowwise) self.w3 = parallelize_module(w3, Colwise) def forward(self, x: Tensor) -> Tensor: y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x)) # y is a DTensor with Partial placement; we can return it as is. return y # Or we can convert it to Replicate -- there is modeling flexibility here. return y.redistribute(Replicate()) with device_mesh: model = FeedForward(config) # Now model is a model parallelized onto device_mesh y = model(x) ``` The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context. Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case. Pull Request resolved: pytorch#134247 Approved by: https://github.com/tianyu-l
Pull Request resolved: pytorch#136396 Approved by: https://github.com/amjames, https://github.com/eellison ghstack dependencies: pytorch#136055
…#137032) Summary: # Why The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type # What When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error # Background pytorch#131594 Test Plan: - execute repro from pytorch#135584 (comment) with and without the edits Differential Revision: D63652564 Pull Request resolved: pytorch#137032 Approved by: https://github.com/angelayi
…cases (pytorch#137056) Fixes pytorch#132950 This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch. This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue pytorch#132950). We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly. Co-authored-by: tarat44 <[email protected]> Pull Request resolved: pytorch#137056 Approved by: https://github.com/fduwjj Co-authored-by: tarat44 <[email protected]>
Summary: Fixes pytorch#122429 Pull Request resolved: pytorch#137085 Approved by: https://github.com/eellison
All jobs have switched to Python 3.9. These 3.8 builds no longer necessary Pull Request resolved: pytorch#137141 Approved by: https://github.com/albanD
) Pull Request resolved: pytorch#137065 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#136826, pytorch#137043, pytorch#137049
As compiler optimizes it away anyway Pull Request resolved: pytorch#137122 Approved by: https://github.com/kit1980
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support). With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing. With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` Pull Request resolved: pytorch#136967 Approved by: https://github.com/xmfan
…136943) Refactor distributed test code: - Fix TODO: Remove unused variable - Fix doc typo - Migrate deprecated method call `load_state_dict` and `save_state_dict` Pull Request resolved: pytorch#136943 Approved by: https://github.com/H-Huang
…h#137073) Summary: We skip the save_gpu_kernel if kernel is being saved already. This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm: Before: <img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a"> After: <img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118"> We can see that before the change, the benchmarking includes two parts, (1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation. (2) The exact computation of Triton kernel. We see that (1) accounts >50% of time, which makes kernel selection for profiling choosing aten kernels over Triton kernels. Test Plan: Existing OSS CI python test/inductor/test_cuda_cpp_wrapper.py Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: pytorch#137073 Approved by: https://github.com/desertfire
|
Accidental push with multiple commits. Closing this PR. |
# Motivation This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(#131758 (comment)) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: #138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey
…h#138216) # Motivation This pr is an extension of pytorch#131758. As described in pytorch#131758, these changes are looking to make distributed UTs more accessible to users of all device types. It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for pytorch#131758(pytorch#131758 (comment)) This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device. The new generalized content can be added by deriving from this base class. Also includes other misc changes for gaudi support The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases. The following changes have been made to test_functional_api.py: -Functionality has been added to test for non cuda devices using intel HPU as an example -Multiple set up steps previously required by MultiProcessTestCase have been abstracted out -Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs -Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator. I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR. This pr is a cleaned up version of a previous PR(pytorch#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well Pull Request resolved: pytorch#138216 Approved by: https://github.com/kwen2501, https://github.com/guangyey
Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.
The UTs currently are specific to CUDA devices as explicit CUDA API calls are used.
To allow for easier changes to be made for non CUDA devices, we introduce a new class DistributedTestBase derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.
The tests can be instantiated per device using existing utilities such as instantiate_device_type_tests , which will pass the device as argument and accordingly create the PG with the right backend
The new generalized content can be added by deriving from this base class.
One such change has also been raised in this pr specifically the test_functional_api.py file. It has allowed for clean and easy adaptations for non CUDA devices as demonstrated.
CC: @kwen2501 @wconstab @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn