-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[cuda] introduce trace tracker callback in cache allocator #112238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112238
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (4 Unrelated Failures)As of commit f289e51 with merge base 9d09d29 ( UNSTABLE - The following jobs failed but were likely due to flakiness present on trunk and has been marked as unstable:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
4e2a581 to
3e1dcd0
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
3e1dcd0 to
631637b
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, I only have minor comments. It will need some test from within PyTorch to make sure the events get called.
631637b to
c8855f0
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
c8855f0 to
4826432
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
|
Addressed comments and added a unit test |
|
@pytorchbot label "topic: not user facing" |
4826432 to
c5e8a61
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
c5e8a61 to
74ea813
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
74ea813 to
0cc4515
Compare
0cc4515 to
213be0b
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
213be0b to
b4e6ada
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
b4e6ada to
d3a4d68
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
zdevito
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Thanks for getting that test to run
…pytorch#112238) Summary: This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records. - It allows external of cache allocator to "attach" trace tracker callbacks. - When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action. - **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock. - **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock. See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL. Reviewed By: zdevito Differential Revision: D50726971
d3a4d68 to
f289e51
Compare
|
This pull request was exported from Phabricator. Differential Revision: D50726971 |
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…12238) Summary: This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records. - It allows external of cache allocator to "attach" trace tracker callbacks. - When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action. - **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock. - **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock. See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL. Differential Revision: D50726971 Pull Request resolved: pytorch#112238 Approved by: https://github.com/zdevito
…12238) Summary: This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records. - It allows external of cache allocator to "attach" trace tracker callbacks. - When a TraceEntry is recorded, it triggers all attached callbacks. Callbacks can selectively behave based on the trace action. - **RISK**: The attached callback would be called within an allocator call stack (e.g., free during an allocate call). Potential deadlock may occur if other locks are called within the callback and has interdependency w/ the device allocator lock. It is the callback developer's responsibility to avoid any potential deadlock. - **ADVICE**: The callback mechanism is designed **only for Pytorch internal use**. We should not expose it to Python layer due to Python GIL that would cause a deadlock. See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL. Differential Revision: D50726971 Pull Request resolved: pytorch#112238 Approved by: https://github.com/zdevito
Summary:
This patch prototypes a trace tracker callback mechanism based on existing TraceEntry records.
See example in D50726970 that attaches NCCL register/deregister hooks via the trace tracker callback, so that all CUDA segments allocated by the allocator can be registered to NCCL communicators before any NCCL communication happens. This enables fast zero copy algorithms in NCCL.
Differential Revision: D50726971