-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Avoid making node a successor/predecessor of itself #161205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161205
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 7299c21 with merge base 9668210 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
| for index, node in enumerate(nodes): | ||
| size_alloc = sum(buffer.mpi_buffer.size_alloc for buffer in node.get_outputs()) | ||
| succ_nodes = node_to_succ_nodes[node] | ||
| pred_nodes = node_to_pred_nodes[node] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, even though this fixes the issue, I wonder why there are cyclic dependency in the first place.
I wonder if there are some errors in how the pred_nodes and succ_nodes are constructed, or if there are some issues in the dependency tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will send you the repro. the input GraphModule is not cyclic, so it must be something as a result of dependencies. this was blocking some of the training runs so i want to land this but agreed more investigation after is good.
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at. Pull Request resolved: pytorch#161205 Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at. Pull Request resolved: pytorch#161205 Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
Stack from ghstack (oldest at bottom):
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben