Check python REFCNT before deleting functions #16888

mrshenli · 2019-02-08T04:55:57Z

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

apaszke

That's some serious inversion of logic here. How is it that we never needed to check Python refcounts, but need to do it now? It should never be the case that you have owning Python references that don't increase the C++ refcount.

apaszke · 2019-02-08T15:36:15Z

torch/csrc/autograd/function.cpp

-    if (edge.function.use_count() == 1) {
-      stack.emplace_back(std::move(edge.function));
+    auto & fn = edge.function;
+    // When using DataParallel, modules will be replcated multiple times, with


Module replication has nothing to do with autograd graph. Every single replicated module will create an independent part of the graph, and use_counts should be updated accordingly.

I can verify that in the failing case mentioned in #16532, there are 3 shared_ptrs pointing to the same BroadcastBackward function object, and the use_count() on all 3 shared_ptrs are 1. I think it starts from here where we create multiple shared_ptrs from the same raw pointer. It was not a problem prior to 89d56ae, because we rely on Python to determine when to delete a function. However, after 89d56ae, C++ code will remove function edges from autograd graph based on shared_ptr use_count.

Functions inherit from enable_shared_from_this, so we should just replace this line with self->cdata->shared_from_this()

mrshenli · 2019-02-08T16:07:51Z

It should never be the case that you have owning Python references that don't increase the C++ refcount.

That was also what I thought, but after adding a print before this line, I got:

Single GPU Functional
Loss: -0.1287534236907959
Grad: [tensor([[0.2500, 0.2500, 0.2500, 0.2500]], device='cuda:0')]

Multi-GPU Functional
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 1, python refcnt is 3
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 2, python refcnt is 2
Loss: -0.1287534236907959
Grad: [tensor([[0.2500, 0.2500, 0.2500, 0.2500]], device='cuda:0'), tensor([0.], device='cuda:0')]

Multi-GPU Inline
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 1, python refcnt is 2
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 1, python refcnt is 1
Loss: -0.1287534236907959
Grad: [tensor([[0.2500, 0.2500, 0.2500, 0.2500]], device='cuda:0'), tensor([0.], device='cuda:0')]
BroadcastBackward@0x56139536c600 shared_ptr use_count is 1, python refcnt is 3
BroadcastBackward@0x56139536c600 shared_ptr use_count is 1, python refcnt is 2
BroadcastBackward@0x56139536c600 shared_ptr use_count is 2, python refcnt is 1
BroadcastBackward@0x56139536c600 shared_ptr use_count is 1, python refcnt is 1

Edit: added BroadcastBackward function object address

~~It still looks weird to me that the inline case is playing with two different BroadcastBackward objects @apaszke any thought?~~

I think it is because the loss object created in "Multi-GPU Functional" was deleted when the var is assigned again in the "Multi-GPU Inline" case.

apaszke · 2019-02-08T19:01:59Z

Ok, but that means we're not managing the memory correctly, because we have multiple shared_ptrs that point to the same object, but use different control blocks! This is a much more serious bug, and while this patch might be a sufficient workaround for the issue you linked, there's a much bigger issue causing those problems.

mrshenli · 2019-02-08T19:08:32Z

@apaszke any suggestion for fixes? Greg and I discussed about using enable_shared_from_this, but it seems the implementation explicitly avoids shared_from_this to prevent holding a shared_ptr in PyFunction.

apaszke · 2019-02-08T19:16:59Z

Functions already inherit from enable_shared_from_this. Can't we just use that?

mrshenli · 2019-02-08T19:29:54Z

@apaszke I was worried about the comments below. Could you please elaborate some more? Correct me if I get it wrong: If we use shared_from_this, PyFunction will hold a weak_ptr as a member, which requires that a shared_ptr of PyFunction is created and held somewhere. Do we have that shared_ptr created elsewhere? If no, will the behavior be undefined?

// Similar to shared_from_this. There's a problem that the Python object
// and its cdata depend on each other being alive, so we can't keep
// shared_ptrs as members, but we'd like to be able to manage the lifetime of
// the objects using shared_ptrs in the C++ graph. This returns a new
// shared_ptr, which will decrement the Python reference count when it's
// destructed. WARNING: it's generally not safe to create weak_ptrs from
// these shared_ptrs since multiple shared_ptrs may control the same underlying
// object.

mrshenli · 2019-02-08T21:33:40Z

Using shared_from_this in THPFunction_asFunction does not work, because there is no shared_ptr pointing to it:

Multi-GPU Functional
Traceback (most recent call last):
  File "bug_test.py", line 32, in <module>
    loss = gradient_penalty(multigpu_net, x)
  File "bug_test.py", line 10, in gradient_penalty
    output = netD(x)
  File "/home/shenli/project/pytorch/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shenli/project/pytorch/torch/nn/parallel/data_parallel.py", line 139, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/shenli/project/pytorch/torch/nn/parallel/data_parallel.py", line 150, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 28, in scatter
    return scatter_map(inputs)
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
RuntimeError: bad_weak_ptr

mrshenli · 2019-02-12T19:02:48Z

As multiple people reported hitting this bug, shall we have a consensus on the fixes? @apaszke @gchanan @resistor @ssnl

mrshenli · 2019-02-19T16:29:55Z

#14960 is another bug report for this reference count issue.

apaszke · 2019-02-19T17:51:31Z

@mrshenli where are those problematic autograd::Function instances created? One thing we could try would be to try to use shared_from_this() and create a new shared_ptr if that fails. That way we'd always have a single one pointing to that object.

mrshenli · 2019-02-19T19:32:31Z

@apaszke

One thing we could try would be to try to use shared_from_this() and create a new shared_ptr if that fails. That way we'd always have a single one pointing to that object.

Let me try it.

The stack trace is as follows.

replicate.py: replicate
python_function.cpp: THPFunction_apply -> process_outputs -> _wrap_outputs
python_variable.cpp: THPVariable_NewWithVar -> THPFunction_asFunction

one has been created previously

albanD · 2019-02-19T21:30:31Z

UPDATE: this comment was written before your change to use shared_from_this(). As you will se below, I think it will break the fix from #1705.

Looking back at the history it feels like since #1705 , THPFunction are now completely managed by the refcount of their python object. A shared pointer use count is meaningless.
Also the "Similar to shared_from_this." comment is outdated and should be removed. It was the implementation before #1705 that was similar to it (storing a weakptr).

@apaszke we actually want a different shared pointer here. In particular to make sure that each THPVariable -> THPFunction contribute one refcount to the THPFunction object (the link between the two is via a shared pointer).

I'm not sure we need to add python related logic to Functions though.
We can have a use_count() method on Function that counts how many objects smart pointers point to it.

For regular Function, it returns 1 as a single shared pointer ever use it, and we don't do pointer logic with it.
For PyFunction, it returns Py_REFCNT(obj). In this case there is always a THPFunction associated with this PyFunction that is stored in obj and it's refcount represent how many use there are (from shared pointers used in cpp + actual python references).
Based on this, you can delete a Function if and only if the Function.use_count() == 1 and you are the only one using your own smart pointer. Your condition would become, fn->use_count() == 1 && fn.use_count() == 1: if the Function is only referenced by a single smart pointer and I only have 1 ref left.

mrshenli · 2019-02-19T22:08:18Z

@albanD I agree that the current fix using shared_from_this() breaks what #1705 tries to achieve. Your proposed solution SGTM. I will wait for 1-2 days for others to comment. If no objection, I will revise this PR. Thanks!

apaszke · 2019-02-23T09:58:37Z

Ok, while I agree that the current solution might regress #1705, having us check Python refcounts in pure C++ code (that should have no dependency on Python) is an example of a tail wagging the dog... For me it's clear that we absolutely need to use a single shared_ptr, or we will have similar bugs in the future too. What we should do instead is have a more sophisticated deleter. In particular, it should no longer be a simple invocation of Py_DECREF, but should be an object that stores how many decrefs should it perform when it is destroyed. We can bump this by doing something like py_function_ptr.get_deleter()->bump_decref();.

ezyang · 2019-07-15T21:20:02Z

@malvika2147 and I were looking at the refcount situation in PyFunction, and I think the whole ownership situation here is a bit screwy.

Let us introduce our dramatis personae:

THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

Once we fix this, we can fix the stupidity, e.g., here:

// torch/csrc/autograd/python_variable.cpp [unix, utf-8] [ln:57 col:22] 
    if (auto fn = dynamic_cast<PyFunction*>(v->cdata.grad_fn_unsafe())) {
      // Create a new reference to the THPFunction. This ensures that ref count
      // of the THPFunction is at least the number of referring THPVariables.
      const auto output_nr = v->cdata.output_nr();
      auto grad_fn = THPFunction_asFunction((THPFunction*)fn->obj);
      v->cdata.set_gradient_edge({std::move(grad_fn), output_nr});
    }

I also suspect that there is no need to have torch.autograd.Function be backed by an actual C++ class (unless we desperately want ctx to be a subclass of the Function you defined, but I don't see any good reason why we should work so hard to actually make this true.)

It might be easier to fix this once we apply @yf225's patch at #22762, since I suspect the reason we got into this situation in the first place was because of legacy autograd function API.

mrshenli · 2019-07-16T01:44:04Z

@ezyang @malvika2147

Thanks for the investigation. If you plan to work on this, feel free to abandon this PR or hijack it. It will take me a while to get back to this.

pietern · 2019-07-22T09:55:19Z

This should be addressed by #22983 per @ezyang.

check python refcnt before deleting functions

25815bb

mrshenli requested review from gchanan and resistor February 8, 2019 04:56

facebook-github-bot reviewed Feb 8, 2019

View reviewed changes

avoid using python_headers.h in C++ sources

d3eb941

facebook-github-bot reviewed Feb 8, 2019

View reviewed changes

apaszke suggested changes Feb 8, 2019

View reviewed changes

apaszke reviewed Feb 8, 2019

View reviewed changes

mrshenli mentioned this pull request Feb 19, 2019

v1.0.0 nn.utils.weight_norm seems to nullify gradients of unrelated parameters if wrapped in DataParallel #14960

Closed

mrshenli requested a review from soumith February 19, 2019 16:28

Use shared_from_this() to create PyFunction shared_ptr if at least

e59901d

one has been created previously

L0SG mentioned this pull request Apr 23, 2019

Distributed Training with Apex ksw0306/FloWaveNet#22

Merged

thuyen mentioned this pull request Jul 17, 2019

Loss explosion with DataParallel on WGAN models #19024

Closed

ezyang mentioned this pull request Jul 17, 2019

Invert ownership between PyFunction and THPFunction. #22983

Closed

pietern closed this Jul 22, 2019

Check python REFCNT before deleting functions #16888

Check python REFCNT before deleting functions #16888

Uh oh!

Conversation

mrshenli commented Feb 8, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

apaszke Feb 8, 2019

Choose a reason for hiding this comment

Uh oh!

mrshenli Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

apaszke Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrshenli commented Feb 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit: added BroadcastBackward function object address

Uh oh!

apaszke commented Feb 8, 2019

Uh oh!

mrshenli commented Feb 8, 2019

Uh oh!

apaszke commented Feb 8, 2019

Uh oh!

mrshenli commented Feb 8, 2019

Uh oh!

mrshenli commented Feb 8, 2019

Uh oh!

mrshenli commented Feb 12, 2019

Uh oh!

mrshenli commented Feb 19, 2019

Uh oh!

apaszke commented Feb 19, 2019

Uh oh!

mrshenli commented Feb 19, 2019

Uh oh!

albanD commented Feb 19, 2019

Uh oh!

mrshenli commented Feb 19, 2019

Uh oh!

apaszke commented Feb 23, 2019

Uh oh!

ezyang commented Jul 15, 2019 • edited by mrshenli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrshenli commented Jul 16, 2019

Uh oh!

pietern commented Jul 22, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mrshenli Feb 8, 2019 •

edited

Loading

apaszke Feb 8, 2019 •

edited

Loading

mrshenli commented Feb 8, 2019 •

edited

Loading

ezyang commented Jul 15, 2019 •

edited by mrshenli

Loading