Skip to content

Conversation

@mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Feb 8, 2019

fixes #16532

@mrshenli mrshenli requested review from gchanan and resistor February 8, 2019 04:56
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's some serious inversion of logic here. How is it that we never needed to check Python refcounts, but need to do it now? It should never be the case that you have owning Python references that don't increase the C++ refcount.

if (edge.function.use_count() == 1) {
stack.emplace_back(std::move(edge.function));
auto & fn = edge.function;
// When using DataParallel, modules will be replcated multiple times, with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Module replication has nothing to do with autograd graph. Every single replicated module will create an independent part of the graph, and use_counts should be updated accordingly.

Copy link
Contributor Author

@mrshenli mrshenli Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can verify that in the failing case mentioned in #16532, there are 3 shared_ptrs pointing to the same BroadcastBackward function object, and the use_count() on all 3 shared_ptrs are 1. I think it starts from here where we create multiple shared_ptrs from the same raw pointer. It was not a problem prior to 89d56ae, because we rely on Python to determine when to delete a function. However, after 89d56ae, C++ code will remove function edges from autograd graph based on shared_ptr use_count.

Copy link
Contributor

@apaszke apaszke Feb 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functions inherit from enable_shared_from_this, so we should just replace this line with self->cdata->shared_from_this()

@mrshenli
Copy link
Contributor Author

mrshenli commented Feb 8, 2019

It should never be the case that you have owning Python references that don't increase the C++ refcount.

That was also what I thought, but after adding a print before this line, I got:

Single GPU Functional
Loss: -0.1287534236907959
Grad: [tensor([[0.2500, 0.2500, 0.2500, 0.2500]], device='cuda:0')]

Multi-GPU Functional
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 1, python refcnt is 3
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 2, python refcnt is 2
Loss: -0.1287534236907959
Grad: [tensor([[0.2500, 0.2500, 0.2500, 0.2500]], device='cuda:0'), tensor([0.], device='cuda:0')]

Multi-GPU Inline
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 1, python refcnt is 2
BroadcastBackward@0x56135dc84c50 shared_ptr use_count is 1, python refcnt is 1
Loss: -0.1287534236907959
Grad: [tensor([[0.2500, 0.2500, 0.2500, 0.2500]], device='cuda:0'), tensor([0.], device='cuda:0')]
BroadcastBackward@0x56139536c600 shared_ptr use_count is 1, python refcnt is 3
BroadcastBackward@0x56139536c600 shared_ptr use_count is 1, python refcnt is 2
BroadcastBackward@0x56139536c600 shared_ptr use_count is 2, python refcnt is 1
BroadcastBackward@0x56139536c600 shared_ptr use_count is 1, python refcnt is 1

Edit: added BroadcastBackward function object address

It still looks weird to me that the inline case is playing with two different BroadcastBackward objects @apaszke any thought?

I think it is because the loss object created in "Multi-GPU Functional" was deleted when the var is assigned again in the "Multi-GPU Inline" case.

@apaszke
Copy link
Contributor

apaszke commented Feb 8, 2019

Ok, but that means we're not managing the memory correctly, because we have multiple shared_ptrs that point to the same object, but use different control blocks! This is a much more serious bug, and while this patch might be a sufficient workaround for the issue you linked, there's a much bigger issue causing those problems.

@mrshenli
Copy link
Contributor Author

mrshenli commented Feb 8, 2019

@apaszke any suggestion for fixes? Greg and I discussed about using enable_shared_from_this, but it seems the implementation explicitly avoids shared_from_this to prevent holding a shared_ptr in PyFunction.

@apaszke
Copy link
Contributor

apaszke commented Feb 8, 2019

Functions already inherit from enable_shared_from_this. Can't we just use that?

@mrshenli
Copy link
Contributor Author

mrshenli commented Feb 8, 2019

@apaszke I was worried about the comments below. Could you please elaborate some more? Correct me if I get it wrong: If we use shared_from_this, PyFunction will hold a weak_ptr as a member, which requires that a shared_ptr of PyFunction is created and held somewhere. Do we have that shared_ptr created elsewhere? If no, will the behavior be undefined?

// Similar to shared_from_this. There's a problem that the Python object
// and its cdata depend on each other being alive, so we can't keep
// shared_ptrs as members, but we'd like to be able to manage the lifetime of
// the objects using shared_ptrs in the C++ graph. This returns a new
// shared_ptr, which will decrement the Python reference count when it's
// destructed. WARNING: it's generally not safe to create weak_ptrs from
// these shared_ptrs since multiple shared_ptrs may control the same underlying
// object.

@mrshenli
Copy link
Contributor Author

mrshenli commented Feb 8, 2019

Using shared_from_this in THPFunction_asFunction does not work, because there is no shared_ptr pointing to it:

Multi-GPU Functional
Traceback (most recent call last):
  File "bug_test.py", line 32, in <module>
    loss = gradient_penalty(multigpu_net, x)
  File "bug_test.py", line 10, in gradient_penalty
    output = netD(x)
  File "/home/shenli/project/pytorch/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shenli/project/pytorch/torch/nn/parallel/data_parallel.py", line 139, in forward
    inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
  File "/home/shenli/project/pytorch/torch/nn/parallel/data_parallel.py", line 150, in scatter
    return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 35, in scatter_kwargs
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 28, in scatter
    return scatter_map(inputs)
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map
    return list(zip(*map(scatter_map, obj)))
  File "/home/shenli/project/pytorch/torch/nn/parallel/scatter_gather.py", line 13, in scatter_map
    return Scatter.apply(target_gpus, None, dim, obj)
RuntimeError: bad_weak_ptr

@mrshenli
Copy link
Contributor Author

As multiple people reported hitting this bug, shall we have a consensus on the fixes? @apaszke @gchanan @resistor @ssnl

@mrshenli
Copy link
Contributor Author

#14960 is another bug report for this reference count issue.

@apaszke
Copy link
Contributor

apaszke commented Feb 19, 2019

@mrshenli where are those problematic autograd::Function instances created? One thing we could try would be to try to use shared_from_this() and create a new shared_ptr if that fails. That way we'd always have a single one pointing to that object.

@mrshenli
Copy link
Contributor Author

@apaszke

One thing we could try would be to try to use shared_from_this() and create a new shared_ptr if that fails. That way we'd always have a single one pointing to that object.

Let me try it.

The stack trace is as follows.

replicate.py: replicate
python_function.cpp: THPFunction_apply -> process_outputs -> _wrap_outputs
python_variable.cpp: THPVariable_NewWithVar -> THPFunction_asFunction

@albanD
Copy link
Collaborator

albanD commented Feb 19, 2019

UPDATE: this comment was written before your change to use shared_from_this(). As you will se below, I think it will break the fix from #1705.

Looking back at the history it feels like since #1705 , THPFunction are now completely managed by the refcount of their python object. A shared pointer use count is meaningless.
Also the "Similar to shared_from_this." comment is outdated and should be removed. It was the implementation before #1705 that was similar to it (storing a weakptr).

@apaszke we actually want a different shared pointer here. In particular to make sure that each THPVariable -> THPFunction contribute one refcount to the THPFunction object (the link between the two is via a shared pointer).

I'm not sure we need to add python related logic to Functions though.
We can have a use_count() method on Function that counts how many objects smart pointers point to it.

  • For regular Function, it returns 1 as a single shared pointer ever use it, and we don't do pointer logic with it.
  • For PyFunction, it returns Py_REFCNT(obj). In this case there is always a THPFunction associated with this PyFunction that is stored in obj and it's refcount represent how many use there are (from shared pointers used in cpp + actual python references).
    Based on this, you can delete a Function if and only if the Function.use_count() == 1 and you are the only one using your own smart pointer. Your condition would become, fn->use_count() == 1 && fn.use_count() == 1: if the Function is only referenced by a single smart pointer and I only have 1 ref left.

@mrshenli
Copy link
Contributor Author

@albanD I agree that the current fix using shared_from_this() breaks what #1705 tries to achieve. Your proposed solution SGTM. I will wait for 1-2 days for others to comment. If no objection, I will revise this PR. Thanks!

@apaszke
Copy link
Contributor

apaszke commented Feb 23, 2019

Ok, while I agree that the current solution might regress #1705, having us check Python refcounts in pure C++ code (that should have no dependency on Python) is an example of a tail wagging the dog... For me it's clear that we absolutely need to use a single shared_ptr, or we will have similar bugs in the future too. What we should do instead is have a more sophisticated deleter. In particular, it should no longer be a simple invocation of Py_DECREF, but should be an object that stores how many decrefs should it perform when it is destroyed. We can bump this by doing something like py_function_ptr.get_deleter()->bump_decref();.

@ezyang
Copy link
Contributor

ezyang commented Jul 15, 2019

@malvika2147 and I were looking at the refcount situation in PyFunction, and I think the whole ownership situation here is a bit screwy.

Let us introduce our dramatis personae:

  • THPFunction - a PyObject that represents a ctx object that is passed to our custom Python autograd functions. It contains a PyFunction
  • PyFunction - a subclass of Function which says how to compute the actual backward function. Note that this is a closure in the sense that it is associated with a ctx object which actually contains the saved tensors. It contains a non-owning pointer to THPFunction

OK, do you see something wrong with this situation? The ownership here is totally backwards! It is PyFunction that should own the THPFunction, because a PyFunction is a closure that should own its context. To put it in other words, it is OK if a PyFunction gets freed before THPFunction, but not vice versa.

Once we fix this, we can fix the stupidity, e.g., here:

// torch/csrc/autograd/python_variable.cpp [unix, utf-8] [ln:57 col:22] 
    if (auto fn = dynamic_cast<PyFunction*>(v->cdata.grad_fn_unsafe())) {
      // Create a new reference to the THPFunction. This ensures that ref count
      // of the THPFunction is at least the number of referring THPVariables.
      const auto output_nr = v->cdata.output_nr();
      auto grad_fn = THPFunction_asFunction((THPFunction*)fn->obj);
      v->cdata.set_gradient_edge({std::move(grad_fn), output_nr});
    }

I also suspect that there is no need to have torch.autograd.Function be backed by an actual C++ class (unless we desperately want ctx to be a subclass of the Function you defined, but I don't see any good reason why we should work so hard to actually make this true.)

It might be easier to fix this once we apply @yf225's patch at #22762, since I suspect the reason we got into this situation in the first place was because of legacy autograd function API.

@mrshenli
Copy link
Contributor Author

@ezyang @malvika2147

Thanks for the investigation. If you plan to work on this, feel free to abandon this PR or hijack it. It will take me a while to get back to this.

@pietern
Copy link
Contributor

pietern commented Jul 22, 2019

This should be addressed by #22983 per @ezyang.

@pietern pietern closed this Jul 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing gradient when autograd called inside a function on Multi-GPU (eg gradient penalty)

6 participants