[DTensor] Fix deadlock after fast cache clear#168069
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168069
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit ff57ecf with merge base e5eb89e ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
The label |
|
The label |
|
The label |
|
The label |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
``` git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58 ``` And Revert "[DTensor] Document fast-path dispatch (#168192)" And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)" Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 25bd260 Pull-Request: #168264
``` git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58 ``` And Revert "[DTensor] Document fast-path dispatch (#168192)" And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)" Reverts: * #167860 * #167588 * #167475 * #166808 * #166372 * #168192 * #168069 Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #168264 Approved by: https://github.com/seemethere, https://github.com/malfet
``` git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58 ``` And Revert "[DTensor] Document fast-path dispatch (#168192)" And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)" Reverts: * #167860 * #167588 * #167475 * #166808 * #166372 * #168192 * #168069 Signed-off-by: Edward Z. Yang <[email protected]> Pull Request resolved: #168264 Approved by: https://github.com/seemethere, https://github.com/malfet
This is the necessary fix for meta-pytorch/autoparallel#256.
Issue:
when we call
_clear_fast_path_sharding_prop_cache(), and thenget_thread_local_native_sharding_propagator_cache(), the code will stuck due to deadlock.Cause:
When you assign to a Python dict key that already exists:
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci