Skip to content

Conversation

@huydhn
Copy link
Contributor

@huydhn huydhn commented Oct 11, 2024

This test is currently failing in trunk when memory leak check is enabled, for example https://github.com/pytorch/pytorch/actions/runs/11296206361/job/31422348823#step:22:1970. When testing locally, calling backward on a masked tensor always causes memory leak until I clean up the data and the mask manually. This is probably related to this warning from masked tensor UserWarning: It is not recommended to create a MaskedTensor with a tensor that requires_grad. To avoid this, you can use data.clone().detach(), but I don't know much about the internal details here to go further. So, let's just fix the test first/

Testing

PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 python test/test_maskedtensor.py TestBasicsCUDA.test_stack_cuda

passes and doesn't warn about memory leak anymore.

The test itself came from #125262 (comment)

@pytorch-bot
Copy link

pytorch-bot bot commented Oct 11, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/137815

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 11e67f5 with merge base 7c1d939 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Oct 11, 2024
@huydhn huydhn added the module: masked operators Masked operations label Oct 11, 2024
@kit1980
Copy link
Contributor

kit1980 commented Oct 11, 2024

The warning for the requires_grad maybe unrelated and can be easily fixed by removing requires_grad=True.
But requires_grad=True seems to have a purpose there, I don't understand what exactly.

@huydhn huydhn requested a review from cpuhrsch October 11, 2024 22:32
@huydhn
Copy link
Contributor Author

huydhn commented Oct 11, 2024

@cpuhrsch I see that you are the reviewer of #125262 (comment) and would appreciate if you have any insights to share here

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Oct 12, 2024

@huydhn - I'm not sure why this is happening. If the fix in this PR resolves the leak in the test I suggest we land it.

@huydhn
Copy link
Contributor Author

huydhn commented Oct 12, 2024

@pytorchbot merge -f 'All slow tests have passed'

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Oct 14, 2024

@huydhn - Thanks for landing the hotfix. I'll dig into this more tomorrow just in case it's a symptom of something worse. Thank you. Also cc @nowtryz

@nowtryz
Copy link
Contributor

nowtryz commented Oct 14, 2024

Hi, Thank you for the fix!

If requires_grad caused the memory leak the in the test, I can look into it and manually require the grad on the masked tensor.

@cpuhrsch I did not consider it but I was using masked tensors just for parts of my code so gradients were passing through the masked part. Do you know where the leak may come from? Otherwise we could manualy delete the tensors in MaskedTensor.__del__?

edit: never mind, as per #137890, del may raise other issues

@huydhn
Copy link
Contributor Author

huydhn commented Oct 14, 2024

Yeah, #137890 has the proper fix. The issue has always been there I guess.

pytorchmergebot pushed a commit that referenced this pull request Oct 15, 2024
Note that this reverts the change from #137815 as well which is not needed anymore!

Without this, you create an unbeakable reference cycle. It is unbreakable because part of the cycle is through the autograd graph which we cannot traverse.
Pull Request resolved: #137890
Approved by: https://github.com/atalman, https://github.com/huydhn, https://github.com/Skylion007
@github-actions github-actions bot deleted the fix-test_maskedtensor-test_stack_memory_leak branch November 14, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants