Micro-optimisations for matmul #64387

lezcano · 2021-09-01T17:55:26Z

Stack from ghstack:

-> Micro-optimisations for matmul #64387

A number of these include:

Always prefer DimVector over std::vector when handling dimensions.
Make the code const correct.
Create DimVector's more efficiently (e.g. prefer append over
insert).
Access sizes of the tensors via sizes().front() / sizes().back()
/ sizes().end()[-2]
Do not create intermediary tensors / vectors when it can be avoided.
Dispatch to mv rather than mm in the case (n,) x (n, m).
Call reshape rather than expect_contiguous + view

On top of this, while fixing the CI and after further discussions, the following features were added.

[Fix CI] Add forward AD to mv, matmul, __rmatmul__
[Fix CI] Add support for Half for a number of operations
[Fix CI] Change some calls that called directly into the matmul implementation. Now they call into _out.
Remove the uses of set_. (requested by @ezyang )
Solve the resize bug in matmul_out.

Looking into the future, further optimisations could include dispatching to lower level functions
when one of the inputs has shape (n, 1) or (1, n).

Fixes #67767

cc @VitalyFedyunin @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano @heitorschueroff

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. [ghstack-poisoned]

facebook-github-bot · 2021-09-01T17:55:32Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/64387
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 7b18adf (more details on the Dr. CI page):

7/7 failures introduced in this PR

🕵️ 7 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_linux_xenial_py3_6_gcc5_4_test (1/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Nov 03 18:08:21 FAIL [0.015s]: test_noncontiguo...amples_matmul_cpu_float32 (__main__.TestCommonCPU)

Nov 03 18:08:21     return fn(self, *args, **kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
Nov 03 18:08:21     fn(*args, **kwargs)
Nov 03 18:08:21   File "test_ops.py", line 263, in test_noncontiguous_samples
Nov 03 18:08:21     self.assertEqual(actual_grad, expected_grad)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
Nov 03 18:08:21     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Nov 03 18:08:21 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
Nov 03 18:08:21 
Nov 03 18:08:21 ======================================================================
Nov 03 18:08:21 FAIL [0.015s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
Nov 03 18:08:21 ----------------------------------------------------------------------
Nov 03 18:08:21 Traceback (most recent call last):
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 03 18:08:21     result = test(self, **param_kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 03 18:08:21     return test(*args, **kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
Nov 03 18:08:21     return fn(self, *args, **kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
Nov 03 18:08:21     fn(*args, **kwargs)

linux-xenial-py3.6-gcc7 / test (default, 1, 2, linux.2xlarge) (2/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:38:56.3879186Z FAIL [0.013s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)

2021-11-03T17:38:56.3870749Z     return fn(self, *args, **kwargs)
2021-11-03T17:38:56.3871599Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:38:56.3872273Z     fn(*args, **kwargs)
2021-11-03T17:38:56.3872769Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:38:56.3873412Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:38:56.3874385Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:38:56.3875267Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:38:56.3877192Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:38:56.3878288Z 
2021-11-03T17:38:56.3878587Z ======================================================================
2021-11-03T17:38:56.3879186Z FAIL [0.013s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:38:56.3880017Z ----------------------------------------------------------------------
2021-11-03T17:38:56.3880526Z Traceback (most recent call last):
2021-11-03T17:38:56.3881480Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:38:56.3882278Z     result = test(self, **param_kwargs)
2021-11-03T17:38:56.3883391Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:38:56.3884138Z     return test(*args, **kwargs)
2021-11-03T17:38:56.3884987Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:38:56.3885804Z     return fn(self, *args, **kwargs)
2021-11-03T17:38:56.3886674Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:38:56.3887336Z     fn(*args, **kwargs)

parallelnative-linux-xenial-py3.6-gcc5.4 / test (default, 1, 1, linux.2xlarge) (3/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:53:56.2652839Z FAIL [0.016s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)

2021-11-03T17:53:56.2644084Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:56.2645251Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:56.2646356Z     fn(*args, **kwargs)
2021-11-03T17:53:56.2647184Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:53:56.2647855Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:53:56.2648783Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:53:56.2649559Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:53:56.2651058Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:53:56.2652022Z 
2021-11-03T17:53:56.2652297Z ======================================================================
2021-11-03T17:53:56.2652839Z FAIL [0.016s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:53:56.2653595Z ----------------------------------------------------------------------
2021-11-03T17:53:56.2654068Z Traceback (most recent call last):
2021-11-03T17:53:56.2654907Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:53:56.2655598Z     result = test(self, **param_kwargs)
2021-11-03T17:53:56.2656390Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:53:56.2657037Z     return test(*args, **kwargs)
2021-11-03T17:53:56.2657791Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:53:56.2658422Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:56.2659178Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:56.2659742Z     fn(*args, **kwargs)

linux-xenial-py3-clang5-mobile-custom-build-dynamic / build (4/7)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2021-11-05T10:20:12.1240598Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m

2021-11-05T10:20:12.1234879Z �[36;1mfi�[0m
2021-11-05T10:20:12.1235309Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2021-11-05T10:20:12.1235995Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2021-11-05T10:20:12.1236654Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2021-11-05T10:20:12.1237422Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2021-11-05T10:20:12.1237990Z �[36;1m  exit 1�[0m
2021-11-05T10:20:12.1238267Z �[36;1mfi�[0m
2021-11-05T10:20:12.1238697Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2021-11-05T10:20:12.1239357Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2021-11-05T10:20:12.1239950Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2021-11-05T10:20:12.1240598Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2021-11-05T10:20:12.1241304Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2021-11-05T10:20:12.1241743Z �[36;1m  exit 1�[0m
2021-11-05T10:20:12.1242006Z �[36;1mfi�[0m
2021-11-05T10:20:12.1242363Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2021-11-05T10:20:12.1252088Z shell: /usr/bin/bash -e {0}
2021-11-05T10:20:12.1252390Z env:
2021-11-05T10:20:12.1253176Z   BUILD_ENVIRONMENT: linux-xenial-py3-clang5-mobile-custom-build-dynamic
2021-11-05T10:20:12.1254644Z   DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
2021-11-05T10:20:12.1256097Z   SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
2021-11-05T10:20:12.1257069Z   XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

linux-bionic-py3.6-clang9 / test (default, 1, 2, linux.2xlarge) (5/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:40:44.5696596Z FAIL [0.012s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)

2021-11-03T17:40:44.5689303Z     return fn(self, *args, **kwargs)
2021-11-03T17:40:44.5690085Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:40:44.5690666Z     fn(*args, **kwargs)
2021-11-03T17:40:44.5691094Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:40:44.5691646Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:40:44.5692473Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:40:44.5693222Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:40:44.5694730Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:40:44.5695644Z 
2021-11-03T17:40:44.5695914Z ======================================================================
2021-11-03T17:40:44.5696596Z FAIL [0.012s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:40:44.5697335Z ----------------------------------------------------------------------
2021-11-03T17:40:44.5697789Z Traceback (most recent call last):
2021-11-03T17:40:44.5698809Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:40:44.5699496Z     result = test(self, **param_kwargs)
2021-11-03T17:40:44.5700285Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:40:44.5700908Z     return test(*args, **kwargs)
2021-11-03T17:40:44.5701641Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:40:44.5702264Z     return fn(self, *args, **kwargs)
2021-11-03T17:40:44.5703094Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:40:44.5703655Z     fn(*args, **kwargs)

linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (6/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:53:10.6138204Z FAIL [0.013s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)

2021-11-03T17:53:10.6130943Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:10.6131694Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:10.6132257Z     fn(*args, **kwargs)
2021-11-03T17:53:10.6132692Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:53:10.6133277Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:53:10.6134116Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:53:10.6134879Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:53:10.6136483Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:53:10.6137397Z 
2021-11-03T17:53:10.6137678Z ======================================================================
2021-11-03T17:53:10.6138204Z FAIL [0.013s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:53:10.6138927Z ----------------------------------------------------------------------
2021-11-03T17:53:10.6139395Z Traceback (most recent call last):
2021-11-03T17:53:10.6140227Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:53:10.6140980Z     result = test(self, **param_kwargs)
2021-11-03T17:53:10.6141784Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:53:10.6142404Z     return test(*args, **kwargs)
2021-11-03T17:53:10.6143156Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:53:10.6143782Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:10.6144513Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:10.6145084Z     fn(*args, **kwargs)

linux-xenial-py3.6-gcc5.4 / test (default, 1, 2, linux.2xlarge) (7/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:37:53.2721884Z FAIL [0.014s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)

2021-11-03T17:37:53.2713232Z     return fn(self, *args, **kwargs)
2021-11-03T17:37:53.2714773Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:37:53.2715781Z     fn(*args, **kwargs)
2021-11-03T17:37:53.2716218Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:37:53.2716777Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:37:53.2717669Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:37:53.2718435Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:37:53.2719940Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:37:53.2721077Z 
2021-11-03T17:37:53.2721346Z ======================================================================
2021-11-03T17:37:53.2721884Z FAIL [0.014s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:37:53.2722629Z ----------------------------------------------------------------------
2021-11-03T17:37:53.2723085Z Traceback (most recent call last):
2021-11-03T17:37:53.2723922Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:37:53.2724671Z     result = test(self, **param_kwargs)
2021-11-03T17:37:53.2725471Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:37:53.2726108Z     return test(*args, **kwargs)
2021-11-03T17:37:53.2726861Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:37:53.2727474Z     return fn(self, *args, **kwargs)
2021-11-03T17:37:53.2728227Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:37:53.2728793Z     fn(*args, **kwargs)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: f654d1d Pull Request resolved: #64387

@jianyuh

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano [ghstack-poisoned]

@jianyuh

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano [ghstack-poisoned]

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: ef79746 Pull Request resolved: #64387

test/test_linalg.py

@VitalyFedyunin

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: 1e996bb Pull Request resolved: #64387

lezcano · 2021-09-07T16:06:47Z

The following script:

from torch.utils.benchmark import Timer
import timeit
timer = Timer("torch::matmul(m1, m2);", setup="at::Tensor m1=torch::zeros({1,1,1});at::Tensor
m2=torch::zeros({1,1,1});", language="c++", timer=timeit.default_timer)
stats=timer.collect_callgrind(number=30, repeats=3)
print(stats[1].as_standardized().stats(inclusive=False))

Throws an instruction count of 951003 in master and of 872341 in this PR.

Using shapes (1, 1, 1) and (1, 1) it goes from 472891 (master) to 459918 (this PR).

@VitalyFedyunin

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]

lezcano · 2021-09-08T12:50:58Z

A more thorough instruction bench.

Shapes	master	This PR
(1,1,1), (1,1,1)	951003	870151
(1,1,1), (1,1)	472891	457008
(1,1,1), (1,)	585042	426104
(1,1), (1,1,1)	871951	791223
(1,), (1,1,1)	1020331	566204
(1,1), (1,1)	251204	249914
(1,1), (1,)	222644	221354
(1,), (1,1)	436218	354948
(1,), (1,)	195584	194324

Note in particular the 50% instructions in the case when the rhs is a batch of matrices and the lhs is a vector.

The reduction in the case when we use the out version is likely higher, but I did not bench that branch as it's not used nearly as much as the main branch.

This PR implements a number of general principles to try to save on unnecessary computations:

Make the code const-correct
Prefer DimVector over std::vector
Perform less operations on DimVector
Call mv rather than mm whenever possible
Call native::mv_out to avoid a dispatch
Work with the optional value from out directly to avoid a ref bump

This is ready for review @ngimel

@VitalyFedyunin

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: 4aa9e4a Pull Request resolved: #64387

@VitalyFedyunin

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. ghstack-source-id: ec7134d Pull Request resolved: #64387

ngimel

Sorry for delay with review, this looks great, I have small questions about testing.

aten/src/ATen/native/LinearAlgebra.cpp

ngimel · 2021-09-14T20:12:05Z

aten/src/ATen/native/LinearAlgebra.cpp

+        const Tensor output = at::_unsafe_view(at::mm_out(*out_opt, t1, tensor2), output_size);
+        return out_opt->set_(output);
+      } else {
+        const Tensor output = at::_unsafe_view(at::native::mv_out(t1, tensor2, *out_opt), output_size);


ugh, native::mv_out and at::mm_out couple lines above is pretty confusing (not to mention their different order of out argument)

I know... There is no native::mm_out so I could not call it directly...

ngimel · 2021-09-14T20:17:11Z

test/test_linalg.py

+        assertEqual(ans, expected)

-        out = torch.zeros(*shape, dtype=torch.int64).to(x.device)
+        out = torch.empty((0,), dtype=x.dtype, device=x.device)


To make tests more robust, you probably want to allocate out of the correct shape filled with nans here, but you got rid of the shape, so I don't know how feasible it is.

Luckily we compute the non-out version just before, so we have the shapes there :)

test/test_linalg.py

@VitalyFedyunin

A number of these include: - Always prefer `DimVector` over `std::vector` when handling dimensions. - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. This also includes an optimisation by dispatching vector matrix prodcuts to `mv` rather than `mm`. Further optimisations could include dispatching to lower level functions when one of the inputs has shape `(n, 1)` or `(1, n)`. cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano [ghstack-poisoned]

…ic boogaloo" This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]

This PR implements the bulk of #64387 Part of the optimisations were already merged in #72230 A number of these optimisations include: - Make the code `const` correct. - Create `DimVector`'s more efficiently (e.g. prefer `append` over `insert`). - Access sizes of the tensors via `sizes().front()` / `sizes().back()` / `sizes().end()[-2]` - Do not create intermediary tensors / vectors when it can be avoided. - Call `reshape` rather than `expect_contiguous` + `view` On top of these, it fixes a correctness issue of `matmul_out`, where the out parameter was not resized correctly when passed to the backends. This involves removing the use of `set_` from the calling code, as requested by ezyang, and it incurs on most of the complexity of the code that this PR adds. [ghstack-poisoned]