-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Check python REFCNT before deleting functions #16888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's some serious inversion of logic here. How is it that we never needed to check Python refcounts, but need to do it now? It should never be the case that you have owning Python references that don't increase the C++ refcount.
torch/csrc/autograd/function.cpp
Outdated
| if (edge.function.use_count() == 1) { | ||
| stack.emplace_back(std::move(edge.function)); | ||
| auto & fn = edge.function; | ||
| // When using DataParallel, modules will be replcated multiple times, with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module replication has nothing to do with autograd graph. Every single replicated module will create an independent part of the graph, and use_counts should be updated accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can verify that in the failing case mentioned in #16532, there are 3 shared_ptrs pointing to the same BroadcastBackward function object, and the use_count() on all 3 shared_ptrs are 1. I think it starts from here where we create multiple shared_ptrs from the same raw pointer. It was not a problem prior to 89d56ae, because we rely on Python to determine when to delete a function. However, after 89d56ae, C++ code will remove function edges from autograd graph based on shared_ptr use_count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functions inherit from enable_shared_from_this, so we should just replace this line with self->cdata->shared_from_this()
That was also what I thought, but after adding a print before this line, I got: Edit: added BroadcastBackward function object address
I think it is because the |
|
Ok, but that means we're not managing the memory correctly, because we have multiple |
|
@apaszke any suggestion for fixes? Greg and I discussed about using |
|
Functions already inherit from |
|
@apaszke I was worried about the comments below. Could you please elaborate some more? Correct me if I get it wrong: If we use // Similar to shared_from_this. There's a problem that the Python object
// and its cdata depend on each other being alive, so we can't keep
// shared_ptrs as members, but we'd like to be able to manage the lifetime of
// the objects using shared_ptrs in the C++ graph. This returns a new
// shared_ptr, which will decrement the Python reference count when it's
// destructed. WARNING: it's generally not safe to create weak_ptrs from
// these shared_ptrs since multiple shared_ptrs may control the same underlying
// object. |
|
Using Multi-GPU Functional
Traceback (most recent call last):
File "bug_test.py", line 32, in <module>
loss = gradient_penalty(multigpu_net, x)
File "bug_test.py", line 10, in gradient_penalty
output = netD(x)
File "/home/shenli/project/pytorch/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/home/shenli/project/pytorch/torch/nn/parallel/data_parallel.py", line 139, in forward
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
File "/home/shenli/project/pytorch/torch/nn/parallel/data_parallel.py", line 150, in scatter
return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
inputs = scatter(inputs, target_gpus, dim) if inputs else []
File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 28, in scatter
return scatter_map(inputs)
File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
return list(zip(*map(scatter_map, obj)))
File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
return Scatter.apply(target_gpus, None, dim, obj)
RuntimeError: bad_weak_ptr |
|
#14960 is another bug report for this reference count issue. |
|
@mrshenli where are those problematic |
Let me try it. The stack trace is as follows. replicate.py: replicate |
one has been created previously
|
UPDATE: this comment was written before your change to use Looking back at the history it feels like since #1705 , THPFunction are now completely managed by the refcount of their python object. A shared pointer use count is meaningless. @apaszke we actually want a different shared pointer here. In particular to make sure that each THPVariable -> THPFunction contribute one refcount to the THPFunction object (the link between the two is via a shared pointer). I'm not sure we need to add python related logic to Functions though.
|
|
Ok, while I agree that the current solution might regress #1705, having us check Python refcounts in pure C++ code (that should have no dependency on Python) is an example of a tail wagging the dog... For me it's clear that we absolutely need to use a single |
|
@malvika2147 and I were looking at the refcount situation in PyFunction, and I think the whole ownership situation here is a bit screwy. Let us introduce our dramatis personae:
OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the Once we fix this, we can fix the stupidity, e.g., here: // torch/csrc/autograd/python_variable.cpp [unix, utf-8] [ln:57 col:22]
if (auto fn = dynamic_cast<PyFunction*>(v->cdata.grad_fn_unsafe())) {
// Create a new reference to the THPFunction. This ensures that ref count
// of the THPFunction is at least the number of referring THPVariables.
const auto output_nr = v->cdata.output_nr();
auto grad_fn = THPFunction_asFunction((THPFunction*)fn->obj);
v->cdata.set_gradient_edge({std::move(grad_fn), output_nr});
}I also suspect that there is no need to have torch.autograd.Function be backed by an actual C++ class (unless we desperately want It might be easier to fix this once we apply @yf225's patch at #22762, since I suspect the reason we got into this situation in the first place was because of legacy autograd function API. |
|
Thanks for the investigation. If you plan to work on this, feel free to abandon this PR or hijack it. It will take me a while to get back to this. |
fixes #16532