Skip to content

[DTensor] Fix deadlock after fast cache clear#168069

Closed
zpcore wants to merge 3 commits intomainfrom
piz/prop_cache_clean
Closed

[DTensor] Fix deadlock after fast cache clear#168069
zpcore wants to merge 3 commits intomainfrom
piz/prop_cache_clean

Conversation

@zpcore
Copy link
Member

@zpcore zpcore commented Nov 18, 2025

This is the necessary fix for meta-pytorch/autoparallel#256.

Issue:

when we call _clear_fast_path_sharding_prop_cache(), and then get_thread_local_native_sharding_propagator_cache(), the code will stuck due to deadlock.

Cause:

When you assign to a Python dict key that already exists:

thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = old_capsule  // capsule #1 stored
...
clear_DTensor_sharding_propagator_cache() // call to clean up the cache
...
get_thread_local_native_sharding_propagator_cache() {
  std::lock_guard<std::mutex> lock(
        native_sharding_propagator_cache_cleanup_mutex);  // FIRST claims the lock!
  if (!native_sharding_propagator_cache_DO_NOT_USE.has_value()) { // enter this again because we have cleared the cache.
    ...
    // Destroys old_capsule FIRST then stores new_capsule. However, where we destroy the old_capsule, 
    // it will trigger the destructor to claim `native_sharding_propagator_cache_cleanup_mutex` again!
    thread_dict["__DTensor_fastpath_thread_cache_cleanup"] = new_capsule  // SECOND claims the lock before FIRST releases
  }
}

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @pragupta @msaroufim @dcci

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168069

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit ff57ecf with merge base e5eb89e (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot removed the oncall: distributed Add this issue/PR to distributed oncall triage queue label Nov 18, 2025
@zpcore zpcore requested a review from ezyang November 18, 2025 06:44
@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 18, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

The label oncall: distributed is only applicable to issues and has been removed. Please only use this label on issues.

@zpcore zpcore added the topic: not user facing topic category label Nov 18, 2025
@zpcore zpcore requested review from wconstab and weifengpy November 18, 2025 17:13
@albanD albanD removed their request for review November 18, 2025 20:41
@zpcore
Copy link
Member Author

zpcore commented Nov 19, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 19, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

ezyang added a commit that referenced this pull request Nov 20, 2025
```
git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58
```

And Revert "[DTensor] Document fast-path dispatch (#168192)"
And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)"

Signed-off-by: Edward Z. Yang <[email protected]>
ghstack-source-id: 25bd260
Pull-Request: #168264
pytorchmergebot pushed a commit that referenced this pull request Nov 21, 2025
```
git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58
```

    And Revert "[DTensor] Document fast-path dispatch (#168192)"
    And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)"

Reverts:
* #167860
* #167588
* #167475
* #166808
* #166372
* #168192
* #168069

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: #168264
Approved by: https://github.com/seemethere, https://github.com/malfet
JacobSzwejbka pushed a commit that referenced this pull request Dec 8, 2025
```
git revert --no-commit 567dcdb 200156e 3d801a4 2034ca9 480b4ff f570e58
```

    And Revert "[DTensor] Document fast-path dispatch (#168192)"
    And Revert "[DTensor] Fix deadlock after fast cache clear (#168069)"

Reverts:
* #167860
* #167588
* #167475
* #166808
* #166372
* #168192
* #168069

Signed-off-by: Edward Z. Yang <[email protected]>

Pull Request resolved: #168264
Approved by: https://github.com/seemethere, https://github.com/malfet
@github-actions github-actions bot deleted the piz/prop_cache_clean branch December 20, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants