DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop by swolchok · Pull Request #166371 · pytorch/pytorch

swolchok · 2025-10-28T04:14:58Z

Stack from ghstack (oldest at bottom):

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

… & cached sharding prop Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). [ghstack-poisoned]

pytorch-bot · 2025-10-28T04:15:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166371

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ROCm failures during provisioning step due to network issues

❌ 3 New Failures, 2 Unrelated Failures

As of commit a25a8f8 with merge base 84776e1 ():

NEW FAILURES - The following jobs have failed:

Check mergeability of ghstack PR / ghstack-mergeability-check (gh)
RuntimeError: Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x b5794f6 returned non-zero exit code 1
inductor / unit-test / inductor-test / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py::TestClipGradNormWorldSize4::test_clip_grad_norm_2d
pull / linux-jammy-py3.10-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
test/distributed/tensor/test_dtensor_ops.py::TestMultiThreadedDTensorOpsCPU::test_dtensor_op_db_linalg_multi_dot_cpu_float32

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
shufflenet_v2_x1_0
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh) (similar failure)
test/distributed/tensor/test_dtensor_ops.py::TestMultiThreadedDTensorOpsCPU::test_dtensor_op_db_full_like_cpu_float32

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). [ghstack-poisoned]

…patch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). [ghstack-poisoned]

…ap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…a creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

swolchok · 2025-10-31T06:33:29Z

just updated performance numbers in the description. I assume somebody landed something that improved performance on main, since the baseline is down and so is the result. raw reported benchmark timings in usec:

baseline (#166368): 38.55, 38.27, 39.98, 39.95, 36.39
this PR + following PR #166372: 24.28, 26.33, 25.30, 25.38, 25.92

…o, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

… OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…a creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…t path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…ur own call into Python for DTensor dispatch" Another incremental step: don't deal with custom ops on the critical path, let the dispatcher do that. Take control of calling into Python on the critical path as well. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

… extend C++ fast path to local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…sor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…d creating Python OpSchema in the DTensor dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

… Python for DTensor dispatch" Another incremental step: don't deal with custom ops on the critical path, let the dispatcher do that. Take control of calling into Python on the critical path as well. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

… path to local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…n OpSchema in the DTensor dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…a in the DTensor dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

…ispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

… dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

swolchok · 2025-11-04T18:18:03Z

torch/csrc/autograd/python_variable.cpp

+            hash_combine(
+                std::hash<c10::OperatorHandle>()(op),
+                comparison_key_hash),
+            args_schema_len)),


note that including args_schema_len in the hash is a behavior change, but I believe it's an excellent one: args_schema_len was in the OpSchema.__eq__ implementation all along, so not including it in the hash would just have been leading to unnecessary collisions.

… & cached sharding prop Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). ghstack-source-id: 73573d5 Pull Request resolved: pytorch/pytorch#166371

github-actions · 2026-01-17T02:15:58Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

swolchok requested review from albanD and soulitzer as code owners October 28, 2025 04:14

pytorch-bot bot added ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue labels Oct 28, 2025

swolchok mentioned this pull request Oct 28, 2025

Avoid creating Python OpSchema in the DTensor dispatch fast path #166372

Closed

swolchok added the release notes: distributed (dtensor) release notes category label Oct 31, 2025

swolchok added 2 commits October 30, 2025 21:40

swolchok added 3 commits October 31, 2025 14:36

swolchok mentioned this pull request Nov 1, 2025

extend C++ DTensor fast path to local operator dispatch #166808

Closed

swolchok added a commit that referenced this pull request Nov 3, 2025

fix several bugs in #166371 that tests found on "WIP: extend C++ fast…

9e6036a

… path to local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

swolchok requested review from XilunWu and wconstab November 3, 2025 23:29

swolchok added a commit that referenced this pull request Nov 4, 2025

Update base for fix bugs in #166371 on "WIP: extend C++ fast path to …

9bbb02e

…local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

swolchok added a commit that referenced this pull request Nov 4, 2025

fix bugs in #166371 on "WIP: extend C++ fast path to local operator d…

8775b7b

…ispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]

swolchok commented Nov 4, 2025

View reviewed changes

pytorchbot added the open source label Nov 18, 2025

github-actions bot added the Stale label Jan 17, 2026

github-actions bot closed this Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop#166371

DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop#166371
swolchok wants to merge 10 commits intogh/swolchok/863/basefrom
gh/swolchok/863/head

swolchok commented Oct 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 28, 2025 •

edited

Loading

Uh oh!

swolchok commented Oct 31, 2025

Uh oh!

swolchok Nov 4, 2025

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swolchok commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166371

❗ 1 Active SEVs

❌ 3 New Failures, 2 Unrelated Failures

Uh oh!

swolchok commented Oct 31, 2025

Uh oh!

swolchok Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

swolchok commented Oct 28, 2025 •

edited

Loading

pytorch-bot bot commented Oct 28, 2025 •

edited

Loading