Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998

ezyang · 2019-07-17T22:08:29Z

Stack from ghstack:

Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998 Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking.
Remove dead Function::get_shared_ptr. #22997 Remove dead Function::get_shared_ptr.
Remove dead Decref struct. #22996 Remove dead Decref struct.
Invert ownership between PyFunction and THPFunction. #22983 Invert ownership between PyFunction and THPFunction.

In the original iteration of the patch, I used lock() everywhere to minimize
the amount of code I have to modify. In this patch, I now eliminate as many
lock()s as I can, when the caller is known to have a strong reference to the
PyFunction, and pass that directly.

Along the way, I also bulk up our error messages for checking the result
of the weak pointer dereference. Some of these cases can be triggered
by zany use of legacy autograd function API; might as well let people know
what they've done wrong.

Signed-off-by: Edward Z. Yang [email protected]

…hecking. In the original iteration of the patch, I used lock() everywhere to minimize the amount of code I have to modify. In this patch, I now eliminate as many lock()s as I can, when the caller is known to have a strong reference to the PyFunction, and pass that directly. Along the way, I also bulk up our error messages for checking the result of the weak pointer dereference. Some of these cases can be triggered by zany use of legacy autograd function API; might as well let people know what they've done wrong. Signed-off-by: Edward Z. Yang <[email protected]>

…hecking. In the original iteration of the patch, I used lock() everywhere to minimize the amount of code I have to modify. In this patch, I now eliminate as many lock()s as I can, when the caller is known to have a strong reference to the PyFunction, and pass that directly. Along the way, I also bulk up our error messages for checking the result of the weak pointer dereference. Some of these cases can be triggered by zany use of legacy autograd function API; might as well let people know what they've done wrong. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 66b1aa2 Pull Request resolved: #22998

…ore error checking." In the original iteration of the patch, I used lock() everywhere to minimize the amount of code I have to modify. In this patch, I now eliminate as many lock()s as I can, when the caller is known to have a strong reference to the PyFunction, and pass that directly. Along the way, I also bulk up our error messages for checking the result of the weak pointer dereference. Some of these cases can be triggered by zany use of legacy autograd function API; might as well let people know what they've done wrong. Signed-off-by: Edward Z. Yang <[email protected]>

…hecking. In the original iteration of the patch, I used lock() everywhere to minimize the amount of code I have to modify. In this patch, I now eliminate as many lock()s as I can, when the caller is known to have a strong reference to the PyFunction, and pass that directly. Along the way, I also bulk up our error messages for checking the result of the weak pointer dereference. Some of these cases can be triggered by zany use of legacy autograd function API; might as well let people know what they've done wrong. Signed-off-by: Edward Z. Yang <[email protected]> ghstack-source-id: 296f2e6 Pull Request resolved: #22998

ezyang · 2019-07-18T19:43:41Z

torch/csrc/autograd/functions/pybind.h


 namespace pybind11 { namespace detail {

-// handle Python <-> torch::autograd::Function conversions


I'm not actually sure if this is used anywhere, but it doesn't seem to cause anything to fail when I delete it.

colesbury

The accessors from Python need to properly handle the case where cdata is expired. For example, THPFunction_metadata.

Note that this isn't limited to "legacy" autograd functions. You can grab a grad_fn attribute off a variable and have it live longer than the variable and the rest of the autograd graph.

I find this difficult to review separately from #22983, since the earlier PR introduces undesirable behavior which is mostly fixed up here. I would find it easier to review if the two were combined.

ezyang · 2019-07-18T23:09:55Z

The accessors from Python need to properly handle the case where cdata is expired. For example, THPFunction_metadata.

OK. The best way I could think to do this is to move metadata to live on THPFunction rather than PyFunction. Does this sound reasonable to you? I'll try this change tomorrow.

I find this difficult to review separately from #22983, since the earlier PR introduces undesirable behavior which is mostly fixed up here. I would find it easier to review if the two were combined.

I'm happy to squash. I'll do that tomorrow.

colesbury · 2019-07-19T15:19:17Z

OK. The best way I could think to do this is to move metadata to live on THPFunction rather than PyFunction. Does this sound reasonable to you? I'll try this change tomorrow.

That seems fine. I think it would also OK to raise an exception (but not an internal assertion), return an empty value, or return an empty value and warn. Some tests for the behavior would be good too.

ezyang · 2019-07-19T15:22:51Z

Oh, well, raising an error is a lot easier to do, imma do that first :)

ezyang · 2019-07-19T15:35:40Z

cc'ing @albanD as you may have a better idea what to do about anomaly metadata.

ezyang · 2019-07-19T15:52:42Z

OK, having done some testing, I feel a lot better about not "fixing" this properly. Take a look at this test program:

import torch
from torch.autograd import Function

class MyFunction(Function):
    @staticmethod
    def forward(ctx, x):
        return x 

    @staticmethod
    def backward(ctx, g):
        return g 

x = torch.zeros(1, requires_grad=True)
y = MyFunction.apply(x)
y.backward()
print(y.grad_fn.metadata)
g = y.grad_fn
del y 
print(g.metadata)

On my branch, you get:

{}
terminate called after throwing an instance of 'c10::Error'
  what():  cdata INTERNAL ASSERT FAILED at ../torch/csrc/autograd/python_function.cpp:982, please report a bug to PyTorch.  (THPFunction_metadata at ../torch/csrc/autograd/python_function.cpp:982)                                                         
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fbe7788ac65 in /data/users/ezyang/pytorch-tmp/torch/lib/libc10.so)
frame #1: THPFunction_metadata(THPFunction*, void*) + 0x126 (0x7fbe8febacb6 in /data/users/ezyang/pytorch-tmp/torch/lib/libtorch_python.so)
<omitting python frames>
frame #13: __libc_start_main + 0xf5 (0x7fbea48063d5 in /lib64/libc.so.6)                                                      

Aborted (core dumped)

The reason the first call is OK but the second is not is because y keeps PyFunction live, but g only keeps THPFunction alive. Which means if you're a normal person and don't get rid of your variables, you won't run into this bug. And this bug only happens for user defined functions; regular functions are handled correctly (because we apparently bind the Function to Python directly in that case.)

The correct way to fix this is to make grad_fn be an owning reference to PyFunction. But I am lazy and don't want to fix that now.

ezyang · 2019-07-19T16:40:33Z

As requested by Sam, these PR has been squashed into #22983.

pytorchbot added module: autograd Related to torch.autograd, and the autograd engine in general module: pybind Related to our Python bindings / interactions with other Python libraries labels Jul 17, 2019

This was referenced Jul 17, 2019

Remove dead Decref struct. #22996

Closed

Remove dead Function::get_shared_ptr. #22997

Closed

ezyang requested a review from colesbury July 18, 2019 13:36

ezyang requested a review from apaszke July 18, 2019 19:43

ezyang commented Jul 18, 2019

View reviewed changes

colesbury requested changes Jul 18, 2019

View reviewed changes

ezyang mentioned this pull request Jul 19, 2019

Invert ownership between PyFunction and THPFunction. #22983

Closed

ezyang requested a review from albanD July 19, 2019 15:35

ezyang closed this Jul 19, 2019

facebook-github-bot deleted the gh/ezyang/240/head branch October 28, 2019 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998

Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998

Uh oh!

ezyang commented Jul 17, 2019 •

edited

Loading

Uh oh!

ezyang Jul 18, 2019

Uh oh!

colesbury left a comment

Uh oh!

ezyang commented Jul 18, 2019

Uh oh!

colesbury commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		namespace pybind11 { namespace detail {

		// handle Python <-> torch::autograd::Function conversions

Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998

Avoid weak_ptr::lock() when shared_ptr is provably live; more error checking. #22998

Uh oh!

Conversation

ezyang commented Jul 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ezyang Jul 18, 2019

Choose a reason for hiding this comment

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

ezyang commented Jul 18, 2019

Uh oh!

colesbury commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

ezyang commented Jul 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ezyang commented Jul 17, 2019 •

edited

Loading