Add matmul optimization for the case A.ndim <= 2 && B.ndim >= 3 #20448

skrah · 2019-05-13T20:15:01Z

This addresses #18862.

…ul_memory_usage

skrah · 2019-05-13T22:18:40Z

This PR implements more or less @colesbury's suggestion C from #18862. Suggestion C ends up calling the existing optimization for A.ndim >= 3 && B.ndim <= 2 after transposing the inner dimensions of the arguments and swapping them.

Memory usage

Memory usage as reported by valgrind --tool=massif.

Test case

x = torch.randn(4096, 4096)
y = torch.randn(192, 4096, 1)
z = torch.matmul(x, y)

Before

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 51 18,481,148,839   12,980,531,144   12,978,674,895     1,856,249            0
 52 18,557,183,694   12,980,531,144   12,978,674,895     1,856,249            0
 53 18,633,218,549   12,980,531,144   12,978,674,895     1,856,249            0
 54 18,688,264,583       17,333,504       15,568,202     1,765,302            0

After

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 76  5,124,473,874       96,953,336       95,088,436     1,864,900            0
 77  5,139,417,005       95,591,328       93,726,580     1,864,748            0
 78  5,155,408,421       22,011,848       20,148,194     1,863,654            0
 79  5,171,971,669       21,145,472       19,304,096     1,841,376            0
 80  5,180,817,596       17,866,168       16,086,322     1,779,846            0
 81  5,189,662,538        7,099,760        6,442,537       657,223            0

Timings

Test case

x = torch.randn(4096, 4096)
y = torch.randn(192, 4096, 1)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    z = torch.matmul(x, y)

print(prof)

Before

Self CPU time total: 7.430s
CUDA time total: 7.430s

After

Self CPU time total: 445.886ms
CUDA time total: 445.883ms

skrah · 2019-05-13T22:31:15Z

For small matrices the speed is about the same:

x = torch.randn(64)
y = torch.randn(2, 64, 1)

with torch.autograd.profiler.profile(use_cuda=True) as prof:
    for i in range(10000):
        z = torch.matmul(x, y)

Before

Self CPU time total: 1.076s
CUDA time total: 1.081s

After

Self CPU time total: 939.033ms
CUDA time total: 940.369ms

skrah · 2019-05-13T22:50:48Z

@pytorchbot retest this please.

skrah · 2019-05-14T11:09:20Z

The PR is conservative in not reusing the out_opt argument in the recursive call, but I don't see a big difference in the timings (both the old and the new code have the same timings with or without an explicit out arg).

skrah · 2019-05-14T17:49:17Z

@pytorchbot retest this please.

ezyang · 2019-05-14T20:07:44Z

Hey @colesbury, do you think you can review this? If not, I'll look.

colesbury

Looks good. Please make sure there are correctness tests in test_torch.py and test_autograd.py that cover this case. Specifically:

dim_tensor1=1, dim_tensor2 >= 3
dim_tensor1=2, dim_tensor2 >= 3

aten/src/ATen/native/LinearAlgebra.cpp

skrah · 2019-05-15T10:29:40Z

test_torch.py already has tests here (they fail if an error is introduced in the new code):

pytorch/test/test_torch.py

Line 4137 in 8e26759

result = maybe_squeeze_result(l, r, l_matmul_fn(r))

I've added some basic test_autograd.py tests.

Also I've tested the code with a couple of throwaway scripts like this one, but they may be too long for the unit tests:

import torch
import numpy as np

# 2d, 3d
for N in range(1, 20):
  for M in range(1, 20):
    for P in range(1, 20):
      for O in range(1, 20):
        x = torch.arange(N*M).reshape(N, M)
        y = torch.arange(O*M*P).reshape(O, M, P)
        expected = torch.bmm(x.unsqueeze(0).expand(O, N, M), y)
        z = torch.matmul(x, y)
        if not torch.equal(z, expected):
          raise RuntimeError("different results: %s %s %s %s" % (N, M, P, O))
        # Check contiguity flags via numpy.
        ex = np.array(expected, copy=False)
        zz = np.array(z, copy=False)
        if ex.flags != zz.flags or ex[0].flags != zz[0].flags:
          raise RuntimeError("different flags: %s %s %s %s" % (N, M, P, O))

# 1d, 3d
N = 1
for M in range(1, 20):
  for P in range(1, 20):
    for O in range(1, 20):
      x = torch.arange(M)
      y = torch.arange(O*M*P).reshape(O, M, P)
      expected = torch.bmm(x.expand(O, N, M), y).reshape(O, P)
      z = torch.matmul(x, y)
      if not torch.equal(z, expected):
        raise RuntimeError("different results: %s %s %s %s" % (N, M, P, O))
      # Check contiguity flags via numpy.
      ex = np.array(expected, copy=False)
      zz = np.array(z, copy=False)
      if ex.flags != zz.flags or ex[0].flags != zz[0].flags:
        raise RuntimeError("different flags: %s %s %s %s" % (N, M, P, O))

skrah · 2019-05-15T11:54:22Z

@pytorchbot retest this please.

skrah · 2019-05-15T13:08:15Z

@colesbury Thanks for the comments, I think they have all been addressed.

ezyang · 2019-05-15T13:58:21Z

Also I've tested the code with a couple of throwaway scripts like this one, but they may be too long for the unit tests

That's what the @slowTest decorator is for :)

skrah · 2019-05-15T14:59:49Z

That's what the @slowTest decorator is for :)

OK thanks, actually I can add that test with range(1, 10), then it just takes 1s (4s in a debug build).

Apparently test_matmul_4d_4d was banned at some point, it is in a list THESE_TAKE_WAY_TOO_LONG and I can no longer find the test itself. :)

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@ezyang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: This addresses #18862. Pull Request resolved: pytorch/pytorch#20448 Differential Revision: D15393465 Pulled By: ezyang fbshipit-source-id: 87e5b0ed8253ea00365f420d98ac96dd4e934028

facebook-github-bot · 2019-05-17T23:09:24Z

@ezyang merged this pull request in 8c9f4c5.

skrah added 2 commits May 13, 2019 22:05

Add matmul optimization for the case A.ndim <= 2 && B.ndim >= 3.

cfee304

Merge branch 'master' of https://github.com/pytorch/pytorch into matm…

c824047

…ul_memory_usage

Do not use a reference to IntArrayRef like the surrounding code.

7f0f891

Reusing out_opt in the call reduces memory usage by 20% in a test case.

6a754e5

skrah changed the title ~~[WIP] Add matmul optimization for the case A.ndim <= 2 && B.ndim >= 3~~ Add matmul optimization for the case A.ndim <= 2 && B.ndim >= 3 May 14, 2019

ezyang requested a review from colesbury May 14, 2019 20:07

colesbury reviewed May 14, 2019

View reviewed changes

aten/src/ATen/native/LinearAlgebra.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/LinearAlgebra.cpp Outdated Show resolved Hide resolved

aten/src/ATen/native/LinearAlgebra.cpp Outdated Show resolved Hide resolved

skrah added 2 commits May 15, 2019 12:01

Shorten the code.

b8ada4d

Add tests for both matmul optimization paths.

cf16d15

pytorchbot added the module: tests Issues related to tests (not the torch.testing module) label May 15, 2019

skrah requested a review from colesbury May 15, 2019 13:08

skrah added 2 commits May 15, 2019 20:02

Add short brute force tests for the new matmul code.

de221af

Appease flake8.

7452d55

ezyang approved these changes May 17, 2019

View reviewed changes

facebook-github-bot reviewed May 17, 2019

View reviewed changes

facebook-github-bot closed this in 8c9f4c5 May 17, 2019

facebook-github-bot added the merged label May 17, 2019

skrah mentioned this pull request May 20, 2019

matmul uses too much memory in some batched cases #18862

Closed

ezyang added the open source label Jun 24, 2019

mruberry added the Merged label Oct 28, 2020

Add matmul optimization for the case A.ndim <= 2 && B.ndim >= 3 #20448

Add matmul optimization for the case A.ndim <= 2 && B.ndim >= 3 #20448

Uh oh!

Conversation

skrah commented May 13, 2019

Uh oh!

skrah commented May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skrah commented May 13, 2019

Uh oh!

skrah commented May 13, 2019

Uh oh!

skrah commented May 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skrah commented May 14, 2019

Uh oh!

ezyang commented May 14, 2019

Uh oh!

colesbury left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

skrah commented May 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skrah commented May 15, 2019

Uh oh!

skrah commented May 15, 2019

Uh oh!

ezyang commented May 15, 2019

Uh oh!

skrah commented May 15, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 17, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

skrah commented May 13, 2019 •

edited

Loading

skrah commented May 14, 2019 •

edited

Loading

skrah commented May 15, 2019 •

edited

Loading