[aotd] Fuse tangents subclasses runtime traversals #139068

IvanKobzarev · 2024-10-28T14:47:31Z

Stack from ghstack (oldest at bottom):

Reason:
Currently we have multiple traversals for tangents in runtime:

To check that types and structure are identical to what we guessed during tracing time
Coerce metadata
Coerce memory_format
Unwrap_tensor_subclass
All of them are traversing tangents via tensor_flatten calls the tree of Subclasses.

Change:
To do everything in one traversal at runtime (including flattening)

Implementation details:

Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too.

Preparing memory_format is optional (controlled by with_memory_format=True).

Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic.

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

[ghstack-poisoned]

pytorch-bot · 2024-10-28T14:47:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139068

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit f3db59e with merge base 87f1990 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 1, 3, linux.rocm.gpu) (gh) (similar failure)
distributed/_tools/test_mem_tracker.py::TestMemTracker::test_cuda_tracker_equivalence
periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 2, 3, linux.rocm.gpu) (gh) (similar failure)
distributed/test_symmetric_memory.py::LoweringTest::test_lowering_one_shot_all_reduce
periodic / linux-focal-rocm6.2-py3.10 / test (distributed, 3, 3, linux.rocm.gpu) (gh) (similar failure)
distributed/_tensor/test_matrix_ops.py::DistMatrixOpsTest::test_dtensor_mm

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh) (trunk failure)
distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op
inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 1fcf78c Pull Request resolved: #139068

Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. [ghstack-poisoned]

bdhirsh · 2024-10-28T19:14:47Z

I see some test failures?

Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. [ghstack-poisoned]

IvanKobzarev · 2024-10-29T11:57:36Z

I see some test failures?

Yes, checking. Just missing parenthesis for skipIfTorchDynamo() :)

Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. [ghstack-poisoned]

bdhirsh · 2024-10-30T21:32:51Z

test/functorch/test_aotdispatch.py

        # Not checking equality of ref and x as Exception is expected

    # Partially addresses https://github.com/pytorch/pytorch/issues/106457
+    @skipIfTorchDynamo()


it sounds like prior to this PR, this test would work properly under dynamo, but now it does not. Why?

Hmm if there answer is because dynamo blows up when trying to run directly on the new custom schema objects that we branch on at runtime, then I agree a skip here seems fine (it is unnecessary to get dynamo working on that). But I'd like a comment next to this @Skip explaining exactly what we are not supporting in dynamo

There is an error on symbolic shapes guard verbose printing, that appeared after tangents processing change:

https://gist.github.com/IvanKobzarev/339f6b0b1465de56731cb6d6d14f2a9f

bdhirsh · 2024-10-30T21:35:34Z

test/functorch/test_aotdispatch.py

+    @unittest.skipIf(
+        not torch.distributed.is_available(), "test requires torch distributed"
+    )
+    @skipIfTorchDynamo()


nit: test_dtensor_compile.py is probably a better fit for this test:

(1) it's testing AsyncCollectiveTensor, which is more of a distributed concept

(2) then we won't need to worry about the skipIfTorchDynamo logic, since the tests in that file won't involve dynamo running on the AOTAutograd code.

Agree, moved to test_dtensor_compile

bdhirsh · 2024-10-30T21:36:42Z

torch/_functorch/_aot_autograd/subclass_utils.py

    *,
    count_symints: bool = True,
-) -> List[Union[int, SubclassCreationMeta]]:
+    with_memory_format: bool = False,


Can you help me understand why we want to sometimes not include memory_format when creating subclass meta? If there is a good reason for doing it sometimes and not others, a comment explaining exactly when it is / is not necessary would be nice.

My main logic was to not add overhead on deducing memory_format.
This could also be especially painful if to call it during tracing on FakeTensors with symbolic shapes - memory format checks in my experience give hairy symbolic shapes checks on strides (divisibility, equal to 1, reminder equals 0 etc.).

We use create_subclass_meta for inputs, outputs (in collect_metadata_analysis). I have not seen any usage of memory_format for inputs?

If we need memory format for inputs and outputs too - we can make it non-optional.

oh that's fair - we don't need the memory format info for inputs. Can you just mention that in a comment?

bdhirsh · 2024-10-30T21:39:15Z

torch/_functorch/_aot_autograd/runtime_wrappers.py

-                        (
-                            AOTDispatchAutograd.coerce_runtime_tangent(
+                    flat_processed_tangents = list(
+                        itertools.chain.from_iterable(


have you had a chance to benchmark if the runtime overhead here nets out to being faster/slower than the original code? (I'd imagine that merging the looping over tangents into a single loop would be faster, although I'm also not sure how fast itertools.chain.from_iterable is).

I measured itertools.chain.from_iterable vs sequential list.extend(), itertools.chain.from_iterable was insignificantly faster ( < 1%).

Using updated version of profiling PR #136478

Got that processing runtime tangents for recursive TwoTensor did not change (the diff in measurement std)

average before: 76610ns
average after: 76800ns

This of course depends how expensive is tensor_flatten call for SubClass, for TwoTensor it is cheap :)

bdhirsh · 2024-10-30T21:43:16Z

torch/_functorch/_aot_autograd/runtime_wrappers.py

+    def process_runtime_tangent(x, meta: Union[PlainTensorMeta, SubclassCreationMeta]):
        if not isinstance(x, torch.Tensor):
-            return x
+            return x, [x]


I'm still trying to understand what the purpose of the second return argument of this function is. What do we need it for? (it looks like it's dropped in the outer-most call to process_runtime_tangents)

Current logic on tangents is:

tangents = all_args[TB, TE] traverse_tangents_tree_to_check_type(tangents) all_args = [traverse_subclass_tangents_coerce_metadata(all_args[i]) where i in [TB, TE]] all_args = [traverse_subclass_tangents_coerce_memory_format(all_args[i]) where i in [TB, TE]]] all_args = traverse_subclass_unwrap(all_args)

We are fusing all traversals that check/update in process_runtime_tangents,
and also we fuse traverse_subclass_unwrap into process_runtime_tangents doing flatenning at the same time of checks/updates. The second argument returns updated flattened leaves for each tangent.

As a result we come to the logic with only one subclasses tree traversal on runtime tangents, using second element in tuple as a result of unwrap.

processed_tangents = process_runtime_tangents(all_args[TB, TE]) processed_tangents_leaves = list(itertools.chain_from_iterable(pt[1]) for pt in processed_tangents) all_args = traverse_subclass_unwrap(all_args[:TB]) + processed_tangents_leaves + traverse_subclass_unwrap(all_args[TE+1:])

bdhirsh

looks mostly good - left a few comments

Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. [ghstack-poisoned]

bdhirsh · 2024-10-31T14:51:28Z

torch/_functorch/_aot_autograd/runtime_wrappers.py


-        return x
+        if is_traceable_wrapper_subclass(x):
+            runtime_meta = x.__tensor_flatten__()[1]


nit: I see we're calling __tensor_flatten__() twice, to get the metadata here and the inner keys later. If you think we can easily get away with a single call that seems better, but if not that's ok

Thanks. Yes, originally I thought that we should call torch_flatten one more time after potential coercion (e.g. subclass type change) - I will make a check if x is unchanged - then we do not need extra tensor_flatten. But if coercion happened - than to call tensor_flatten.

Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. cc H-Huang awgu kwen2501 wanchaol fegin fduwjj wz337 wconstab d4l3k c-p-i-o [ghstack-poisoned]

IvanKobzarev · 2024-10-31T21:54:07Z

@pytorchbot merge

pytorchmergebot · 2024-10-31T21:56:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. Pull Request resolved: pytorch#139068 Approved by: https://github.com/bdhirsh

ghstack-source-id: a3c1b5d Pull Request resolved: pytorch/pytorch#139068

[aotd] Fuse tangents subclasses runtime traversals

42ada3d

[ghstack-poisoned]

IvanKobzarev requested review from Chillee, bdhirsh and ezyang as code owners October 28, 2024 14:47

pytorch-bot bot added ciflow/inductor release notes: AO frontend labels Oct 28, 2024

IvanKobzarev added a commit that referenced this pull request Oct 28, 2024

[aotd] Fuse tangents subclasses runtime traversals

1622d6b

ghstack-source-id: 1fcf78c Pull Request resolved: #139068

IvanKobzarev added topic: not user facing topic category ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Oct 28, 2024

IvanKobzarev mentioned this pull request Oct 28, 2024

[aotd] coerce_same_metadata_as_tangent with expected_type for e.g.AsyncCollectiveTensor #139095

Closed

IvanKobzarev added 2 commits October 28, 2024 12:22

IvanKobzarev added 7 commits October 29, 2024 05:27

bdhirsh reviewed Oct 30, 2024

View reviewed changes

pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 31, 2024

bdhirsh reviewed Oct 31, 2024

View reviewed changes

bdhirsh approved these changes Oct 31, 2024

View reviewed changes

IvanKobzarev mentioned this pull request Oct 31, 2024

[aotd] Subclasses profile logging #136478

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 31, 2024

pytorchmergebot added the merging label Oct 31, 2024

pytorchmergebot added the Merged label Nov 1, 2024

pytorchmergebot closed this in d338499 Nov 1, 2024

pytorchmergebot removed the merging label Nov 1, 2024

github-actions bot deleted the gh/IvanKobzarev/80/head branch December 1, 2024 02:21

Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024

[aotd] Fuse tangents subclasses runtime traversals

a4132ef

ghstack-source-id: a3c1b5d Pull Request resolved: pytorch/pytorch#139068

[aotd] Fuse tangents subclasses runtime traversals #139068

[aotd] Fuse tangents subclasses runtime traversals #139068

Uh oh!

Conversation

IvanKobzarev commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139068

✅ You can merge normally! (6 Unrelated Failures)

Uh oh!

bdhirsh commented Oct 28, 2024

Uh oh!

IvanKobzarev commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh Oct 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdhirsh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IvanKobzarev commented Oct 31, 2024

Uh oh!

pytorchmergebot commented Oct 31, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IvanKobzarev commented Oct 28, 2024 •

edited

Loading

pytorch-bot bot commented Oct 28, 2024 •

edited

Loading

IvanKobzarev commented Oct 29, 2024 •

edited

Loading

bdhirsh Oct 30, 2024 •

edited

Loading