DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop#166371
DTensor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop#166371swolchok wants to merge 10 commits intogh/swolchok/863/basefrom
Conversation
… & cached sharding prop Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166371
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 3 New Failures, 2 Unrelated FailuresAs of commit a25a8f8 with merge base 84776e1 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
…op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). [ghstack-poisoned]
…patch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). [ghstack-poisoned]
…ap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…a creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
|
just updated performance numbers in the description. I assume somebody landed something that improved performance on main, since the baseline is down and so is the result. raw reported benchmark timings in usec: baseline (#166368): 38.55, 38.27, 39.98, 39.95, 36.39 |
…o, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
… OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…a creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…t path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…ur own call into Python for DTensor dispatch" Another incremental step: don't deal with custom ops on the critical path, let the dispatcher do that. Take control of calling into Python on the critical path as well. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
… extend C++ fast path to local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…sor dispatch fast path: C++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…d creating Python OpSchema in the DTensor dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
… Python for DTensor dispatch" Another incremental step: don't deal with custom ops on the critical path, let the dispatcher do that. Take control of calling into Python on the critical path as well. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
… path to local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…n OpSchema in the DTensor dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…local operator dispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…a in the DTensor dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…++ unwrap_to_op_info, OpSchema creation, & cached sharding prop" Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
…ispatch" cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
… dispatch fast path" All we need to do is move a few checks around. cc H-Huang awgu wanchaol fegin fduwjj wz337 wconstab d4l3k pragupta msaroufim dcci [ghstack-poisoned]
| hash_combine( | ||
| std::hash<c10::OperatorHandle>()(op), | ||
| comparison_key_hash), | ||
| args_schema_len)), |
There was a problem hiding this comment.
note that including args_schema_len in the hash is a behavior change, but I believe it's an excellent one: args_schema_len was in the OpSchema.__eq__ implementation all along, so not including it in the hash would just have been leading to unnecessary collisions.
… & cached sharding prop Incremental addition of C++ fast path, taking advantage of the DTensor dispatch key to let us work with IValues without Python conversion for the fast path. I tried to just port unwrap_to_op_info to C++, but that didn't get much of a win; the nice thing here seems to be the fusion of unwrap_to_op_info with recompute_comparison_key. This + the following PR appear to reduce DTensor dispatch time for detach() from 43-46 usec (possibly 40-43 usec, don't have firm number written down due to noise) to something like 33-36 usec (using a benchmark similar to the one on #160580). ghstack-source-id: 73573d5 Pull Request resolved: pytorch/pytorch#166371
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Stack from ghstack (oldest at bottom):
Incremental addition of C++ fast path, taking advantage of the DTensor
dispatch key to let us work with IValues without Python conversion for
the fast path. I tried to just port unwrap_to_op_info to C++, but that
didn't get much of a win; the nice thing here seems to be the fusion
of unwrap_to_op_info with recompute_comparison_key.
This + the following PR appear to reduce DTensor dispatch time for
detach() from 36-40 usec to 24-26 usec (using a benchmark similar to the one on #160580 on a Skylake machine.
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci