Skip to content

Conversation

@vishwakftw
Copy link
Contributor

Earlier, the workspace size query and allocation was placed inside the loop.
However, since we have batches of matrices with the same number of rows and columns, the workspace size query and allocation for every matrix in the batch is redundant.

This PR moves the workspace size query and allocation outside the loop, effectively saving (batch_size - 1) number of queries and allocation (and consequently the deallocation).

There is a tremendous speedup in inverse computation as a result of this change.

Changelog:

  • Move workspace query and allocation outside the batch loop

Test Plan:

  • All existing tests for inverse should pass to verify that the change is correct

Earlier, the workspace size query and allocation was placed inside the loop.
However, since we have batches of matrices with the same number of rows and columns, the workspace size query and allocation for every matrix in the batch is redundant.

This PR moves the workspace size query and allocation outside the loop, effectively saving (batch_size - 1) number of queries and allocation (and consequently the deallocation).

There is a tremendous speedup in inverse computation as a result of this change.

Changelog:
- Move workspace query and allocation outside the batch loop

Test Plan:
- All existing tests for inverse should pass to verify that the change is correct
@vishwakftw
Copy link
Contributor Author

Speedup summary on batches of 3 x 3 matrices:

Dimensions Before this PR After this PR Improvement factor (Higher is better)
100 x 3 x 3 95 µs ± 539 ns per loop 24.7 µs ± 326 ns per loop 3.85
200 x 3 x 3 185 µs ± 1.21 µs per loop 43.6 µs ± 227 ns per loop 4.23
500 x 3 x 3 450 µs ± 4.6 µs per loop 99.1 µs ± 189 ns per loop 4.54
1000 x 3 x 3 889 µs ± 4.5 µs per loop 193 µs ± 586 ns per loop 4.61
2000 x 3 x 3 1.74 ms ± 2.83 µs per loop 379 µs ± 1.43 µs per loop 4.59

Copy link
Contributor

@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ezyang merged this pull request in 8c4b2a8.

zdevito pushed a commit to zdevito/ATen that referenced this pull request May 24, 2019
Summary:
Earlier, the workspace size query and allocation was placed inside the loop.
However, since we have batches of matrices with the same number of rows and columns, the workspace size query and allocation for every matrix in the batch is redundant.

This PR moves the workspace size query and allocation outside the loop, effectively saving (batch_size - 1) number of queries and allocation (and consequently the deallocation).

There is a tremendous speedup in inverse computation as a result of this change.

Changelog:
- Move workspace query and allocation outside the batch loop
Pull Request resolved: pytorch/pytorch#20904

Differential Revision: D15495505

Pulled By: ezyang

fbshipit-source-id: 226729734465fcaf896f86e1b1a548a81440e082
@vishwakftw vishwakftw deleted the remove-extra-workspace-queries branch May 29, 2019 17:21
@vishwakftw vishwakftw added the module: performance Issues related to performance, either of kernel code or framework glue label Aug 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: performance Issues related to performance, either of kernel code or framework glue open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants