Skip to content

DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop#166371

Closed
swolchok wants to merge 10 commits intogh/swolchok/863/basefrom
gh/swolchok/863/head
Closed

DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop#166371
swolchok wants to merge 10 commits intogh/swolchok/863/basefrom
gh/swolchok/863/head

Conversation

@swolchok
Copy link
Contributor

@swolchok swolchok commented Oct 28, 2025

Stack from ghstack (oldest at bottom):

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

… & cached sharding prop

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 43-46 usec (possibly 40-43 usec, don't have firm number
written down due to noise) to something like 33-36 usec (using a
benchmark similar to the one on #160580).

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Oct 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166371

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 2 Unrelated Failures

As of commit a25a8f8 with merge base 84776e1 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…op_info, OpSchema creation, & cached sharding prop"

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 43-46 usec (possibly 40-43 usec, don't have firm number
written down due to noise) to something like 33-36 usec (using a
benchmark similar to the one on #160580).

[ghstack-poisoned]
…patch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop"

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 43-46 usec (possibly 40-43 usec, don't have firm number
written down due to noise) to something like 33-36 usec (using a
benchmark similar to the one on #160580).

[ghstack-poisoned]
@swolchok swolchok added the release notes: distributed (dtensor) release notes category label Oct 31, 2025
…ap_to_op_info, OpSchema creation, & cached sharding prop"

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 43-46 usec (possibly 40-43 usec, don't have firm number
written down due to noise) to something like 33-36 usec (using a
benchmark similar to the one on #160580).

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
…a creation, & cached sharding prop"

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 43-46 usec (possibly 40-43 usec, don't have firm number
written down due to noise) to something like 33-36 usec (using a
benchmark similar to the one on #160580).

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
@swolchok
Copy link
Contributor Author

just updated performance numbers in the description. I assume somebody landed something that improved performance on main, since the baseline is down and so is the result. raw reported benchmark timings in usec:

baseline (#166368): 38.55, 38.27, 39.98, 39.95, 36.39
this PR + following PR #166372: 24.28, 26.33, 25.30, 25.38, 25.92

…o, OpSchema creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
… OpSchema creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
…a creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
…t path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
…ur own call into Python for DTensor dispatch"

Another incremental step: don't deal with custom ops on the critical
path, let the dispatcher do that. Take control of calling into Python
on the critical path as well.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
… extend C++ fast path to local operator dispatch"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
…sor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
…d creating Python OpSchema in the DTensor dispatch fast path"

All we need to do is move a few checks around.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
… Python for DTensor dispatch"

Another incremental step: don't deal with custom ops on the critical
path, let the dispatcher do that. Take control of calling into Python
on the critical path as well.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
… path to local operator dispatch"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 3, 2025
…n OpSchema in the DTensor dispatch fast path"

All we need to do is move a few checks around.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
@swolchok swolchok requested review from XilunWu and wconstab November 3, 2025 23:29
…info, OpSchema creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 4, 2025
…local operator dispatch"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 4, 2025
…a in the DTensor dispatch fast path"

All we need to do is move a few checks around.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 4, 2025
…++ unwrap_to_op_info, OpSchema creation, & cached sharding prop"


Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 4, 2025
…ispatch"

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
swolchok added a commit that referenced this pull request Nov 4, 2025
… dispatch fast path"

All we need to do is move a few checks around.

cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci

[ghstack-poisoned]
hash_combine(
std::hash<c10::OperatorHandle>()(op),
comparison_key_hash),
args_schema_len)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that including args_schema_len in the hash is a behavior change, but I believe it's an excellent one: args_schema_len was in the OpSchema.__eq__ implementation all along, so not including it in the hash would just have been leading to unnecessary collisions.

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025
… & cached sharding prop

Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.

This + the following PR appear to reduce DTensor dispatch time for
detach() from 43-46 usec (possibly 40-43 usec, don't have firm number
written down due to noise) to something like 33-36 usec (using a
benchmark similar to the one on #160580).

ghstack-source-id: 73573d5
Pull Request resolved: pytorch/pytorch#166371
@github-actions
Copy link
Contributor

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jan 17, 2026
@github-actions github-actions bot closed this Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (dtensor) release notes category Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants