Skip to content

Conversation

@AnantGulati
Copy link
Contributor

@AnantGulati AnantGulati commented Sep 30, 2024

Motivation

This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

The UTs currently are specific to CUDA devices as explicit CUDA API calls are used.

To allow for easier changes to be made for non CUDA devices, we introduce a new class DistributedTestBase derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The tests can be instantiated per device using existing utilities such as instantiate_device_type_tests , which will pass the device as argument and accordingly create the PG with the right backend

The new generalized content can be added by deriving from this base class.

One such change has also been raised in this pr specifically the test_functional_api.py file. It has allowed for clean and easy adaptations for non CUDA devices as demonstrated.

CC: @kwen2501 @wconstab @XilunWu @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136988

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3ebea74 with merge base 8dddd45 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Sep 30, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 30, 2024
@AnantGulati
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Sep 30, 2024
@drisspg drisspg requested a review from wconstab October 1, 2024 17:16
Copy link
Collaborator

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
I didn't look closely, but it seems there are a lot of replacements from cuda to hpu. I wonder if this PR is a demo and not for landing?

Comment on lines 83 to 84
"importerror": TestSkip(88, "Test skipped due to missing import"),
"no_hpu": TestSkip(88, "HPU is not available."),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for using the same skip code?

Comment on lines 103 to 106
backend_feature["hpu"] = {"hccl"}
backend_feature["plugin"] = set()
if TEST_HPU:
backend_feature["hpu"] = {"hccl"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated setting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @wesbland @pavanbalaji on the name.

)
import operator

import habana_frameworks.torch as ht
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure about this direct import. Is there a better way to do it?

Comment on lines 109 to 116
if not torch.cuda.is_available():
sys.exit(TEST_SKIPS["no_cuda"].exit_code)
if not ht.hpu.is_available():
sys.exit(TEST_SKIPS["no_hpu"].exit_code)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing the cuda lines?

Comment on lines 183 to 187
if torch.cuda.is_available() and torch.cuda.device_count() >= x:
if ht.hpu.is_available() and ht.hpu.device_count() >= x:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

Comment on lines 1325 to 1368
return torch.cuda.device_count()
return ht.hpu.device_count()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

desertfire and others added 20 commits October 2, 2024 14:27
…ch#136906)

Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode.

Pull Request resolved: pytorch#136906
Approved by: https://github.com/chenyang78
Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9.

Pull Request resolved: pytorch#136904
Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78
…pytorch#136670)

Fixes pytorch#136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: pytorch#136670
Approved by: https://github.com/ezyang
…ses) (pytorch#136759)

this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see pytorch#136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: pytorch#136759
Approved by: https://github.com/laithsakka
ghstack dependencies: pytorch#136670
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: pytorch#136760
Approved by: https://github.com/ezyang
ghstack dependencies: pytorch#136670, pytorch#136759
Summary: sam_fast changes from timeout to fail_to_run after pytorch#136591, which "regressed" in a good way. Update the expected result file and continue investigating.

Pull Request resolved: pytorch#136996
Approved by: https://github.com/ezyang
Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast.
Tested in pytorch#136987
Pull Request resolved: pytorch#136981
Approved by: https://github.com/Skylion007
…rch#136982)

Just adds instantiation of the kernels and sometimes explicit cast.
Tested in pytorch#136987
Pull Request resolved: pytorch#136982
Approved by: https://github.com/Skylion007
ghstack dependencies: pytorch#136981
Summary: Some AOTI tensor constants may be model buffers that never needs to be updated.

Differential Revision: D62777502

Pull Request resolved: pytorch#136770
Approved by: https://github.com/muchulee8
…orch#134247)

This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: pytorch#134247
Approved by: https://github.com/tianyu-l
benjaminglass1 and others added 13 commits October 2, 2024 14:29
…#137032)

Summary:
# Why

The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type

# What

When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error

# Background

pytorch#131594

Test Plan: - execute repro from pytorch#135584 (comment) with and without the edits

Differential Revision: D63652564

Pull Request resolved: pytorch#137032
Approved by: https://github.com/angelayi
…cases (pytorch#137056)

Fixes pytorch#132950

This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch.

This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue pytorch#132950).

We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly.

Co-authored-by: tarat44 <[email protected]>

Pull Request resolved: pytorch#137056
Approved by: https://github.com/fduwjj

Co-authored-by: tarat44 <[email protected]>
All jobs have switched to Python 3.9. These 3.8 builds no longer necessary

Pull Request resolved: pytorch#137141
Approved by: https://github.com/albanD
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).

With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.

With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`

Pull Request resolved: pytorch#136967
Approved by: https://github.com/xmfan
…136943)

Refactor distributed test code:
- Fix TODO: Remove unused variable
- Fix doc typo
- Migrate deprecated method call `load_state_dict` and `save_state_dict`

Pull Request resolved: pytorch#136943
Approved by: https://github.com/H-Huang
…h#137073)

Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: pytorch#137073
Approved by: https://github.com/desertfire
@AnantGulati AnantGulati marked this pull request as draft October 2, 2024 11:34
@AnantGulati
Copy link
Contributor Author

Accidental push with multiple commits. Closing this PR.
Will add changes again in a clean new PR

@AnantGulati AnantGulati closed this Oct 2, 2024
@AnantGulati AnantGulati deleted the AnantGulati_distributed_test_case branch October 2, 2024 12:13
pytorchmergebot pushed a commit that referenced this pull request Nov 18, 2024
# Motivation
This pr is an extension of #131758. As described in #131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for #131758(#131758 (comment))

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

Pull Request resolved: #138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…h#138216)

# Motivation
This pr is an extension of pytorch#131758. As described in pytorch#131758, these changes are looking to make distributed UTs more accessible to users of all device types.

It is a demonstration of a few changes discussed by @kwen2501 and @jgong5 in the discussion for pytorch#131758(pytorch#131758 (comment))

This PR contains two types of changes, the first is to the common distributed folder where we have added a new class derived from MultiProcessTestCase which helps abstracts out the process group creation /deletion and other functionality for a given device.

The new generalized content can be added by deriving from this base class.
Also includes other misc changes for gaudi support

The second changed file is test_functional_api. a test file in common distributed. This file is a POC for how we can use this new class to write more device agnostic distributed test cases.

The following changes have been made to test_functional_api.py:
-Functionality has been added to test for non cuda devices using intel HPU as an example
-Multiple set up steps previously required by MultiProcessTestCase have been abstracted out
-Misc adaptations to allow for general call to accelerators while adding test skips instead explicitly skipping for multiple GPUs
-Skipifhpu flags have been added to enable skipping a few Multithreaded test cases which are as yet not supported on HPUs

NOTE: Within test functional api, there are tests which require the use of some multithreading functions which are as yet not supported on HPUs. These have been skipped for hpu using skipHPU decorator.

I will be raising a separate PR to improve usability pf said decorators in a device agnostic setting in the manner suggested by @kwen2501 in a comment on this PR.

This pr is a cleaned up version of a previous PR(pytorch#136988) which I closed due to human error. I have addressed some of the comments made by @kwen2501 in this as well

Pull Request resolved: pytorch#138216
Approved by: https://github.com/kwen2501, https://github.com/guangyey
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: quantization release notes category topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.