Skip to content

Conversation

@eellison
Copy link
Contributor

@eellison eellison commented Aug 21, 2025

Stack from ghstack (oldest at bottom):

This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Aug 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/161205

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7299c21 with merge base 9668210 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

eellison added a commit that referenced this pull request Aug 21, 2025
ghstack-source-id: 27da1cd
Pull Request resolved: #161205
@eellison eellison requested review from wconstab and xuanzhang816 and removed request for wconstab August 21, 2025 21:26
cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben

[ghstack-poisoned]
eellison added a commit that referenced this pull request Aug 21, 2025
ghstack-source-id: ed09810
Pull Request resolved: #161205
@eellison eellison requested a review from IvanKobzarev August 21, 2025 21:31
@eellison eellison added topic: not user facing topic category ciflow/trunk Trigger trunk jobs on your pull request labels Aug 21, 2025
@eellison
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

for index, node in enumerate(nodes):
size_alloc = sum(buffer.mpi_buffer.size_alloc for buffer in node.get_outputs())
succ_nodes = node_to_succ_nodes[node]
pred_nodes = node_to_pred_nodes[node]
Copy link
Contributor

@xuanzhang816 xuanzhang816 Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, even though this fixes the issue, I wonder why there are cyclic dependency in the first place.

I wonder if there are some errors in how the pred_nodes and succ_nodes are constructed, or if there are some issues in the dependency tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will send you the repro. the input GraphModule is not cyclic, so it must be something as a result of dependencies. this was blocking some of the training runs so i want to land this but agreed more investigation after is good.

wincent8 pushed a commit to wincent8/pytorch that referenced this pull request Aug 22, 2025
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

Pull Request resolved: pytorch#161205
Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
This fixes an assertion we were running into in the memory planning about not having an acyclic graph. The repro is very long so hard to make local test of, but fixes repro I am looking at.

Pull Request resolved: pytorch#161205
Approved by: https://github.com/IvanKobzarev, https://github.com/bdhirsh
@github-actions github-actions bot deleted the gh/eellison/821/head branch September 21, 2025 02:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants