Fix masked tensor test_stack memory leak #137815

huydhn · 2024-10-11T22:22:20Z

This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling backward on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach(), but I don't know much about the internal details here to go further. So, let's just fix the test first/

Testing

PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda

passes and doesn't warn about memory leak anymore.

The test itself came from #125262 (comment)

pytorch-bot · 2024-10-11T22:22:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137815

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 11e67f5 with merge base 7c1d939 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kit1980 · 2024-10-11T22:29:16Z

The warning for the requires_grad maybe unrelated and can be easily fixed by removing requires_grad=True.
But requires_grad=True seems to have a purpose there, I don't understand what exactly.

huydhn · 2024-10-11T22:34:52Z

@cpuhrsch I see that you are the reviewer of #125262 (comment) and would appreciate if you have any insights to share here

cpuhrsch · 2024-10-12T00:07:42Z

@huydhn - I'm not sure why this is happening. If the fix in this PR resolves the leak in the test I suggest we land it.

huydhn · 2024-10-12T04:29:20Z

@pytorchbot merge -f 'All slow tests have passed'

pytorchmergebot · 2024-10-12T04:30:49Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

cpuhrsch · 2024-10-14T06:52:46Z

@huydhn - Thanks for landing the hotfix. I'll dig into this more tomorrow just in case it's a symptom of something worse. Thank you. Also cc @nowtryz

nowtryz · 2024-10-14T15:17:36Z

Hi, Thank you for the fix!

If requires_grad caused the memory leak the in the test, I can look into it and manually require the grad on the masked tensor.

@cpuhrsch I did not consider it but I was using masked tensors just for parts of my code so gradients were passing through the masked part. Do you know where the leak may come from? Otherwise we could manualy delete the tensors in MaskedTensor.__del__?

edit: never mind, as per #137890, del may raise other issues

huydhn · 2024-10-14T15:31:45Z

Yeah, #137890 has the proper fix. The issue has always been there I guess.

Note that this reverts the change from #137815 as well which is not needed anymore! Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse. Pull Request resolved: #137890 Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007

Fix masked tensor test_stack memory leak

1bbc456

huydhn added test-config/default ciflow/slow labels Oct 11, 2024

huydhn requested review from a team and clee2000 October 11, 2024 22:22

pytorch-bot bot added the topic: not user facing topic category label Oct 11, 2024

huydhn added the module: masked operators Masked operations label Oct 11, 2024

kit1980 approved these changes Oct 11, 2024

View reviewed changes

huydhn requested a review from cpuhrsch October 11, 2024 22:32

Also fix test_unfold and test_nn_unfold

11e67f5

pytorchmergebot added the merging label Oct 12, 2024

pytorchmergebot added the Merged label Oct 12, 2024

pytorchmergebot closed this in 92cc319 Oct 12, 2024

pytorchmergebot removed the merging label Oct 12, 2024

huydhn mentioned this pull request Oct 13, 2024

slow workflow has been broken for 4+ weeks #136694

Closed

albanD mentioned this pull request Oct 14, 2024

Fix memory leak on masked Tensor #137890

Closed

github-actions bot deleted the fix-test_maskedtensor-test_stack_memory_leak branch November 14, 2024 02:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix masked tensor test_stack memory leak #137815

Fix masked tensor test_stack memory leak #137815

Uh oh!

huydhn commented Oct 11, 2024

Uh oh!

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading

Uh oh!

kit1980 commented Oct 11, 2024

Uh oh!

huydhn commented Oct 11, 2024

Uh oh!

cpuhrsch commented Oct 12, 2024 •

edited

Loading

Uh oh!

huydhn commented Oct 12, 2024

Uh oh!

pytorchmergebot commented Oct 12, 2024

Uh oh!

cpuhrsch commented Oct 14, 2024 •

edited

Loading

Uh oh!

nowtryz commented Oct 14, 2024 •

edited

Loading

Uh oh!

huydhn commented Oct 14, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix masked tensor test_stack memory leak #137815

Fix masked tensor test_stack memory leak #137815

Uh oh!

Conversation

huydhn commented Oct 11, 2024

Testing

Uh oh!

pytorch-bot bot commented Oct 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137815

✅ No Failures

Uh oh!

kit1980 commented Oct 11, 2024

Uh oh!

huydhn commented Oct 11, 2024

Uh oh!

cpuhrsch commented Oct 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Oct 12, 2024

Uh oh!

pytorchmergebot commented Oct 12, 2024

Merge started

Uh oh!

cpuhrsch commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nowtryz commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Oct 11, 2024 •

edited

Loading

cpuhrsch commented Oct 12, 2024 •

edited

Loading

cpuhrsch commented Oct 14, 2024 •

edited

Loading

nowtryz commented Oct 14, 2024 •

edited

Loading

huydhn commented Oct 14, 2024 •

edited

Loading