Skip to content

Conversation

@apaszke
Copy link
Contributor

@apaszke apaszke commented Jan 30, 2019

Fixes #16577.

This greatly improves memory efficiency of certain ops like Dropout2d. Previously, they were implemented as input * mask where mask never requires_grad, but we didn't use that knowledge in forward, and (in case of a in-place dropout) kept input.clone() for the backward, when it would simply get ignored.

This patch tries to address this situation by emitting some guards for stores like this, but only if they are as simple, as checking if a single value requires_grad.

Interestingly, the same optimizations apply to methods like bmm, baddmm, etc., but not to mm nor addmm, because of how their derivatives are defined. Apparently they unnecessarily use mat1 to compute the derivative of mat1 just to improve the error message in case mat1 was sparse. I'd like to apply this optimization to that case, but I don't want to loose the nicer error message, so if anyone has any ideas for solutions, please let me know...

Full list of operators affected by this patch:

  • _nnpack_spatial_convolution
  • addbmm
  • addcdiv
  • addcmul
  • addmv
  • addr
  • baddbmm
  • bmm
  • cross
  • div
  • dot
  • fmod
  • ger
  • index_add_
  • mul
  • mv
  • scatter_add_

This greatly improves memory efficiency of certain ops like
Dropout2d. Previously, they were implemented as `input * mask`
where mask never requires_grad, but we didn't use that knowledge
in forward, and (in case of a in-place dropout) kept input.clone()
for the backward, when it would simply get ignored.

This patch tries to address this situation by emitting some guards
for stores like this, but only if they are as simple, as checking
if a single value requires_grad.
@ssnl
Copy link
Collaborator

ssnl commented Jan 31, 2019

I'd like to apply this optimization to that case, but I don't want to loose the nicer error message, so if anyone has any ideas for solutions, please let me know...

Idea: save TensorGeometry(mat1) instead, and augment TensorGeometry to include sparsity if needed.

@ssnl
Copy link
Collaborator

ssnl commented Jan 31, 2019

This looks awesome. A before & after comparison on generated code would be great! :)

@apaszke
Copy link
Contributor Author

apaszke commented Jan 31, 2019

Oh yes, I forgot to show an example change. Here's the case of mul_ (the one we use in in-place dropout).

Old code:

ensor & VariableType::mul_(Tensor & self, const Tensor & other) const {
  ...
  if (compute_requires_grad( self, other )) {
    grad_fn = std::shared_ptr<MulBackward0>(new MulBackward0(), deleteFunction);
    grad_fn->set_next_edges(collect_next_edges( self, other ));
    grad_fn->self_ = SavedVariable(self.clone(), false);
    grad_fn->other_ = SavedVariable(other, false);
  }
  ...
}

New code:

Tensor & VariableType::mul_(Tensor & self, const Tensor & other) const {
  ...
  if (compute_requires_grad( self, other )) {
    grad_fn = std::shared_ptr<MulBackward0>(new MulBackward0(), deleteFunction);
    grad_fn->set_next_edges(collect_next_edges( self, other ));
    if (grad_fn->should_compute_output(1)) {
      grad_fn->self_ = SavedVariable(self.clone(), false);
    }
    if (grad_fn->should_compute_output(0)) {
      grad_fn->other_ = SavedVariable(other, false);
    }
  }
  ...
}

@apaszke
Copy link
Contributor Author

apaszke commented Jan 31, 2019

TensorGeometry seems like a nice idea, I'll try that in a next patch!

@apaszke
Copy link
Contributor Author

apaszke commented Jan 31, 2019

Failures are either timeouts, or are data loader multiprocess tests, which I doubt are affected by this change.

@gchanan
Copy link
Contributor

gchanan commented Jan 31, 2019

I'm guessing this doesn't also magically fix #15115, but I'm going to check anyway.

@gchanan
Copy link
Contributor

gchanan commented Jan 31, 2019

oh, it appears it does fix #15115.

# In the end the memory usage should remain equal, because neither of
# (x + 2) and ((x + 2) * m) should be kept alive for backward, while the
# previous allocation of z had the same size as the current one.
self.assertEqual(base_mem, end_mem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test of #15115 -- it's probably more robust than this test because it doesn't rely on memory allocations and actually tests something different (i.e. you could solve #15115 by saving but not unpacking, whereas you can't solve the memory issues that way).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add that test, but I still think this one is robust and tests an important thing. The reason why I took up this patch is because we were using on the order of gigabytes more memory for CNNs that used in-place Dropout2d (which is effectively input * mask) which should never happen!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, note that this patch doesn't change the unpacking behavior. So we will still run all the unpacks, except that variables that weren't saved will unpack as undefined tensors (which is fine, because they won't be used anyway).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I'm not saying one test or the other is better, I'm saying they test different things and should both be tested.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@gchanan
Copy link
Contributor

gchanan commented Feb 1, 2019

I don't know why there are failing tests.

@apaszke
Copy link
Contributor Author

apaszke commented Feb 1, 2019

I don't see any tests failing. Where are those? Can you rerun them? It used to be green.

@gchanan
Copy link
Contributor

gchanan commented Feb 1, 2019

I see 4 failing tests:

  • ci/circleci: binary_linux_conda_2.7_cpu_build, ci/circleci: binary_linux_conda_3.6_cu90_build I believe are bogus but should be fixed by Allow USE_NINJA to be toggled by an env variable #16665.
  • pr/py2-clang7-rocmdeb-ubuntu16.04, pr/py2-devtoolset7-rocmrpm-centos7.5 which seem to timeout and which I haven't seen on other PRs.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@gchanan
Copy link
Contributor

gchanan commented Feb 4, 2019

Trying rebase again.

@gchanan
Copy link
Contributor

gchanan commented Feb 4, 2019

@apaszke this has consistently timed out on rocm builds, across a few rebases. Can you take a look?

@apaszke
Copy link
Contributor Author

apaszke commented Feb 4, 2019

How can I debug this? It adds a bit of code in VariableType.cpp, so if ROCm compiler is extremely slow for some reason, we might start overflowing the time if it was close...

@gchanan
Copy link
Contributor

gchanan commented Feb 5, 2019

@apaszke sorry I should have said RoCM tests (the error message says build, but it's on a test).

Here is the latest example:

 18:04:13 test_zeros_like (test_sparse.TestCudaUncoalescedSparse) ... Build timed out (after 120 minutes). Marking the build as failed.

@apaszke
Copy link
Contributor Author

apaszke commented Feb 5, 2019

Still, do we have instructions that would let me reproduce a ROCm build?

@bddppq
Copy link
Contributor

bddppq commented Feb 9, 2019

@pytorchbot retest this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchanan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pearu pushed a commit to Quansight/pytorch that referenced this pull request Feb 12, 2019
Summary:
Fixes pytorch#16577.

This greatly improves memory efficiency of certain ops like Dropout2d. Previously, they were implemented as `input * mask` where mask never requires_grad, but we didn't use that knowledge in forward, and (in case of a in-place dropout) kept input.clone() for the backward, when it would simply get ignored.

This patch tries to address this situation by emitting some guards for stores like this, but only if they are as simple, as checking if a single value requires_grad.

Interestingly, the same optimizations apply to methods like bmm, baddmm, etc., but _not to mm nor addmm_, because of how their derivatives are defined. Apparently they unnecessarily use `mat1` to compute the derivative of `mat1` just to improve the error message in case `mat1` was sparse. I'd like to apply this optimization to that case, but I don't want to loose the nicer error message, so if anyone has any ideas for solutions, please let me know...

Full list of operators affected by this patch:
* _nnpack_spatial_convolution
* addbmm
* addcdiv
* addcmul
* addmv
* addr
* baddbmm
* bmm
* cross
* div
* dot
* fmod
* ger
* index_add_
* mul
* mv
* scatter_add_
Pull Request resolved: pytorch#16583

Differential Revision: D13900881

Pulled By: gchanan

fbshipit-source-id: dd0aeb2ab58c4b6aa95b37b46d3255b3e014291c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dropout2d uses a lot of memory.

6 participants