Skip to content

Conversation

@vishwakftw
Copy link
Contributor

@vishwakftw vishwakftw commented Jul 27, 2018

Complete billing of changes:

Related to Batch Inverse:

  • Add batched inverse (CPU)
  • Add batched inverse (CUDA)
  • Modify autograd entry
  • Add tests
    • test_autograd
    • test_cuda
    • test_torch
  • Modify docs
  • Remove _batch_inverse in MultivariateNormal.
  • Allow batch matrices as inputs for negative powers in matrix_power

Miscellaneous modifications:

  • Move all batch operations to BatchLinearAlgebra.cpp/.cu and provide general framework for adding more batch ops.
  • Add a RAII structure for MAGMA queue management.

@vishwakftw vishwakftw closed this Jul 27, 2018
I didn't know how getri worked after using getrf, which led to some issues
@vishwakftw vishwakftw reopened this Jul 27, 2018
@vadimkantorov
Copy link
Contributor

For a test, e.g. this style transfer code https://github.com/NVIDIA/FastPhotoStyle/blob/master/photo_smooth.py#L77 batch-inverts matrices of shape HxWx3x3 in NumPy, one can compare perf of batched GPU inversion versus NumPy.

@vishwakftw vishwakftw changed the title [WIP] Batched Inverse [ready] Batched Inverse Jul 28, 2018
@vishwakftw
Copy link
Contributor Author

Oops, I didn't notice #9102 . Should I close this?

Copy link
Contributor Author

@vishwakftw vishwakftw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After building, the times between the CPU version and the NumPy version looked similar, but there was a big performance regression in the CUDA case. Help will be appreciated, and I have added my own comments for improving down below.

scalar_t wkopt;
Tensor work;

for (int64_t i = 0; i < batch_size; i++) {

This comment was marked as off-topic.


Tensor _inverse_helper_cpu(const Tensor& self) {
std::vector<int64_t> getrf_infos(batchCount(self), 0);
std::vector<int64_t> getri_infos(batchCount(self), 0);

This comment was marked as off-topic.

scalar_t** self_inv_array;

ALLOCATE_ARRAY(getrf_info_array, magma_int_t, batch_size, self);
ALLOCATE_ARRAY(getri_info_array, magma_int_t, batch_size, self);

This comment was marked as off-topic.

n, n, self_array, n, ipiv_array, getrf_info_array,
batch_size, createMagmaQueue(self));

for (int64_t i = 0; i < batch_size; i++) {

This comment was marked as off-topic.

n, self_array, n, ipiv_array, self_inv_array,
n, getri_info_array, batch_size, createMagmaQueue(self));

for (int64_t i = 0; i < batch_size; i++) {

This comment was marked as off-topic.


magmaGetriBatched<scalar_t>(
n, self_array, n, ipiv_array, self_inv_array,
n, getri_info_array, batch_size, createMagmaQueue(self));

This comment was marked as off-topic.

M = cast(torch.randn(5, 5))
MI = torch.inverse(M)
E = torch.eye(5)
self.assertFalse(MI.is_contiguous(), 'MI is contiguous')

This comment was marked as off-topic.

This comment was marked as off-topic.

@vishwakftw vishwakftw changed the title [ready] Batched Inverse [help needed] Batched Inverse Jul 29, 2018
@li-roy li-roy added the ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes label Jul 31, 2018
@vishwakftw
Copy link
Contributor Author

@zou3519 test failures are unrelated.

namespace native {

#ifdef USE_MAGMA
static magma_queue_t createMagmaQueue(const Tensor& tensor) {

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

}

// Because this is out-of-place inverse, the predefined macros will
// not work

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, otherwise, lgtm!

@vishwakftw
Copy link
Contributor Author

There is an optimization available for CUDA batched getrf for square matrices of dim < 32. Should I include this optimization here, or leave it for a later PR?

@zou3519
Copy link
Contributor

zou3519 commented Oct 22, 2018

@vishwakftw if it's a quick change feel free to do it here. Otherwise, it'll be easier to review as a separate PR

- use magma_*getrf_smallsq_shfl for batches of matrices with dim <= 32
- remove destroyMagmaQueue and make magma queue more RAII-like
return magma_queue;
}

static void destroyMagmaQueue(magma_queue_t& existing_queue) {

This comment was marked as off-topic.

if (self.size(-2) <= 32) {
magmaGetrfSmallSquareBatched<scalar_t>(
n, self_array, n, ipiv_array, info_array,
batch_size, createMagmaQueue(self));

This comment was marked as off-topic.

@vishwakftw
Copy link
Contributor Author

vishwakftw commented Oct 23, 2018

Seemingly, the optimization for small matrix getrf is done inside MAGMA's getrf function itself. I found this while inspecting the source code. I'll revert that change.

#ifdef USE_MAGMA

// RAII for a MAGMA Queue
struct MAGMAQueue {

This comment was marked as off-topic.


magmaGetriBatched<scalar_t>(
n, self_array, n, ipiv_array, self_inv_array,
n, info_array, batch_size, MAGMAQueue(self.get_device()));

This comment was marked as off-topic.

This comment was marked as off-topic.

void magmaGesvBatched(
magma_int_t n, magma_int_t nrhs, scalar_t** dA_array, magma_int_t ldda,
magma_int_t** dipiv_array, scalar_t** dB_array, magma_int_t lddb,
magma_int_t* dinfo_array, magma_int_t batch_count, MAGMAQueue magma_queue) {

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@zou3519 zou3519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now! One last comment :) then we can merge this.

The test failures are unrelated.

struct MAGMAQueue {

// Default constructor, does nothing.
MAGMAQueue() = default;

This comment was marked as off-topic.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zou3519 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Oct 28, 2018
Summary:
Complete billing of changes:

Related to Batch Inverse:
- [x] Add batched inverse (CPU)
- [x] Add batched inverse (CUDA)
- [x] Modify autograd entry
- [x] Add tests
  - [x] test_autograd
  - [x] test_cuda
  - [x] test_torch
- [x] Modify docs
- [x] Remove `_batch_inverse` in `MultivariateNormal`.
- [x] Allow batch matrices as inputs for negative powers in `matrix_power`

Miscellaneous modifications:
- [x] Move all batch operations to BatchLinearAlgebra.cpp/.cu and provide general framework for adding more batch ops.
- [x] Add a RAII structure for MAGMA queue management.
Pull Request resolved: pytorch/pytorch#9949

Differential Revision: D10559089

Pulled By: zou3519

fbshipit-source-id: 7da24977f8a79d97dd42883302e13e708c1726e4
@vishwakftw vishwakftw deleted the batch-inverse branch October 28, 2018 08:58
@Jasonhunger
Copy link

Jasonhunger commented Jan 12, 2021

DTD_inv = torch.inverse(DTD + self.lambda3 * torch.eye(self.input_dim).cuda())
RuntimeError: inverse_cuda: For batch 0: U(704,704) is zero, singular U.

How to solve this error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

open source ready for review (this tag is deprecated) All PRs are ready for review unless they are draft, WIP, or have undismissed requested changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants