Skip to content

Conversation

@lezcano
Copy link
Collaborator

@lezcano lezcano commented Sep 1, 2021

Stack from ghstack:

A number of these include:

  • Always prefer DimVector over std::vector when handling dimensions.
  • Make the code const correct.
  • Create DimVector's more efficiently (e.g. prefer append over
    insert).
  • Access sizes of the tensors via sizes().front() / sizes().back()
    / sizes().end()[-2]
  • Do not create intermediary tensors / vectors when it can be avoided.
  • Dispatch to mv rather than mm in the case (n,) x (n, m).
  • Call reshape rather than expect_contiguous + view

On top of this, while fixing the CI and after further discussions, the following features were added.

  • [Fix CI] Add forward AD to mv, matmul, __rmatmul__
  • [Fix CI] Add support for Half for a number of operations
  • [Fix CI] Change some calls that called directly into the matmul implementation. Now they call into _out.
  • Remove the uses of set_. (requested by @ezyang )
  • Solve the resize bug in matmul_out.

Looking into the future, further optimisations could include dispatching to lower level functions
when one of the inputs has shape (n, 1) or (1, n).

Fixes #67767

cc @VitalyFedyunin @ngimel @jianyuh @nikitaved @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano @heitorschueroff

A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Sep 1, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 7b18adf (more details on the Dr. CI page):


  • 7/7 failures introduced in this PR

🕵️ 7 new failures recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See CircleCI build pytorch_linux_xenial_py3_6_gcc5_4_test (1/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Nov 03 18:08:21 FAIL [0.015s]: test_noncontiguo...amples_matmul_cpu_float32 (__main__.TestCommonCPU)
Nov 03 18:08:21     return fn(self, *args, **kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
Nov 03 18:08:21     fn(*args, **kwargs)
Nov 03 18:08:21   File "test_ops.py", line 263, in test_noncontiguous_samples
Nov 03 18:08:21     self.assertEqual(actual_grad, expected_grad)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
Nov 03 18:08:21     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Nov 03 18:08:21 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
Nov 03 18:08:21 
Nov 03 18:08:21 ======================================================================
Nov 03 18:08:21 FAIL [0.015s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
Nov 03 18:08:21 ----------------------------------------------------------------------
Nov 03 18:08:21 Traceback (most recent call last):
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
Nov 03 18:08:21     result = test(self, **param_kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
Nov 03 18:08:21     return test(*args, **kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
Nov 03 18:08:21     return fn(self, *args, **kwargs)
Nov 03 18:08:21   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
Nov 03 18:08:21     fn(*args, **kwargs)

See GitHub Actions build linux-xenial-py3.6-gcc7 / test (default, 1, 2, linux.2xlarge) (2/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:38:56.3879186Z FAIL [0.013s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:38:56.3870749Z     return fn(self, *args, **kwargs)
2021-11-03T17:38:56.3871599Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:38:56.3872273Z     fn(*args, **kwargs)
2021-11-03T17:38:56.3872769Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:38:56.3873412Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:38:56.3874385Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:38:56.3875267Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:38:56.3877192Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:38:56.3878288Z 
2021-11-03T17:38:56.3878587Z ======================================================================
2021-11-03T17:38:56.3879186Z FAIL [0.013s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:38:56.3880017Z ----------------------------------------------------------------------
2021-11-03T17:38:56.3880526Z Traceback (most recent call last):
2021-11-03T17:38:56.3881480Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:38:56.3882278Z     result = test(self, **param_kwargs)
2021-11-03T17:38:56.3883391Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:38:56.3884138Z     return test(*args, **kwargs)
2021-11-03T17:38:56.3884987Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:38:56.3885804Z     return fn(self, *args, **kwargs)
2021-11-03T17:38:56.3886674Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:38:56.3887336Z     fn(*args, **kwargs)

See GitHub Actions build parallelnative-linux-xenial-py3.6-gcc5.4 / test (default, 1, 1, linux.2xlarge) (3/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:53:56.2652839Z FAIL [0.016s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:53:56.2644084Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:56.2645251Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:56.2646356Z     fn(*args, **kwargs)
2021-11-03T17:53:56.2647184Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:53:56.2647855Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:53:56.2648783Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:53:56.2649559Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:53:56.2651058Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:53:56.2652022Z 
2021-11-03T17:53:56.2652297Z ======================================================================
2021-11-03T17:53:56.2652839Z FAIL [0.016s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:53:56.2653595Z ----------------------------------------------------------------------
2021-11-03T17:53:56.2654068Z Traceback (most recent call last):
2021-11-03T17:53:56.2654907Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:53:56.2655598Z     result = test(self, **param_kwargs)
2021-11-03T17:53:56.2656390Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:53:56.2657037Z     return test(*args, **kwargs)
2021-11-03T17:53:56.2657791Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:53:56.2658422Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:56.2659178Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:56.2659742Z     fn(*args, **kwargs)

See GitHub Actions build linux-xenial-py3-clang5-mobile-custom-build-dynamic / build (4/7)

Step: "Build" (full log | diagnosis details | 🔁 rerun)

2021-11-05T10:20:12.1240598Z �[36;1m echo "ERR...t available for the merge-base of your branch"�[0m
2021-11-05T10:20:12.1234879Z �[36;1mfi�[0m
2021-11-05T10:20:12.1235309Z �[36;1m# Covers the case where a previous tag doesn't exist for the tree�[0m
2021-11-05T10:20:12.1235995Z �[36;1m# this is only really applicable on trees that don't have `.circleci/docker` at its merge base, i.e. nightly�[0m
2021-11-05T10:20:12.1236654Z �[36;1mif ! git rev-parse "$MERGE_BASE:.circleci/docker"; then�[0m
2021-11-05T10:20:12.1237422Z �[36;1m  echo "Directory '.circleci/docker' not found in commit $MERGE_BASE, you should probably rebase onto a more recent commit"�[0m
2021-11-05T10:20:12.1237990Z �[36;1m  exit 1�[0m
2021-11-05T10:20:12.1238267Z �[36;1mfi�[0m
2021-11-05T10:20:12.1238697Z �[36;1mPREVIOUS_DOCKER_TAG=$(git rev-parse "$MERGE_BASE:.circleci/docker")�[0m
2021-11-05T10:20:12.1239357Z �[36;1m# If no image exists but the hash is the same as the previous hash then we should error out here�[0m
2021-11-05T10:20:12.1239950Z �[36;1mif [[ "${PREVIOUS_DOCKER_TAG}" = "${DOCKER_TAG}" ]]; then�[0m
2021-11-05T10:20:12.1240598Z �[36;1m  echo "ERROR: Something has gone wrong and the previous image isn't available for the merge-base of your branch"�[0m
2021-11-05T10:20:12.1241304Z �[36;1m  echo "       contact the PyTorch team to restore the original images"�[0m
2021-11-05T10:20:12.1241743Z �[36;1m  exit 1�[0m
2021-11-05T10:20:12.1242006Z �[36;1mfi�[0m
2021-11-05T10:20:12.1242363Z �[36;1mecho ::set-output name=rebuild::yes�[0m
2021-11-05T10:20:12.1252088Z shell: /usr/bin/bash -e {0}
2021-11-05T10:20:12.1252390Z env:
2021-11-05T10:20:12.1253176Z   BUILD_ENVIRONMENT: linux-xenial-py3-clang5-mobile-custom-build-dynamic
2021-11-05T10:20:12.1254644Z   DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c
2021-11-05T10:20:12.1256097Z   SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
2021-11-05T10:20:12.1257069Z   XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

See GitHub Actions build linux-bionic-py3.6-clang9 / test (default, 1, 2, linux.2xlarge) (5/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:40:44.5696596Z FAIL [0.012s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:40:44.5689303Z     return fn(self, *args, **kwargs)
2021-11-03T17:40:44.5690085Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:40:44.5690666Z     fn(*args, **kwargs)
2021-11-03T17:40:44.5691094Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:40:44.5691646Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:40:44.5692473Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:40:44.5693222Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:40:44.5694730Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:40:44.5695644Z 
2021-11-03T17:40:44.5695914Z ======================================================================
2021-11-03T17:40:44.5696596Z FAIL [0.012s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:40:44.5697335Z ----------------------------------------------------------------------
2021-11-03T17:40:44.5697789Z Traceback (most recent call last):
2021-11-03T17:40:44.5698809Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:40:44.5699496Z     result = test(self, **param_kwargs)
2021-11-03T17:40:44.5700285Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:40:44.5700908Z     return test(*args, **kwargs)
2021-11-03T17:40:44.5701641Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:40:44.5702264Z     return fn(self, *args, **kwargs)
2021-11-03T17:40:44.5703094Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:40:44.5703655Z     fn(*args, **kwargs)

See GitHub Actions build linux-bionic-py3.6-clang9 / test (noarch, 1, 1, linux.2xlarge) (6/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:53:10.6138204Z FAIL [0.013s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:53:10.6130943Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:10.6131694Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:10.6132257Z     fn(*args, **kwargs)
2021-11-03T17:53:10.6132692Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:53:10.6133277Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:53:10.6134116Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:53:10.6134879Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:53:10.6136483Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:53:10.6137397Z 
2021-11-03T17:53:10.6137678Z ======================================================================
2021-11-03T17:53:10.6138204Z FAIL [0.013s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:53:10.6138927Z ----------------------------------------------------------------------
2021-11-03T17:53:10.6139395Z Traceback (most recent call last):
2021-11-03T17:53:10.6140227Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:53:10.6140980Z     result = test(self, **param_kwargs)
2021-11-03T17:53:10.6141784Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:53:10.6142404Z     return test(*args, **kwargs)
2021-11-03T17:53:10.6143156Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:53:10.6143782Z     return fn(self, *args, **kwargs)
2021-11-03T17:53:10.6144513Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:53:10.6145084Z     fn(*args, **kwargs)

See GitHub Actions build linux-xenial-py3.6-gcc5.4 / test (default, 1, 2, linux.2xlarge) (7/7)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-11-03T17:37:53.2721884Z FAIL [0.014s]: tes...amples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:37:53.2713232Z     return fn(self, *args, **kwargs)
2021-11-03T17:37:53.2714773Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:37:53.2715781Z     fn(*args, **kwargs)
2021-11-03T17:37:53.2716218Z   File "test_ops.py", line 263, in test_noncontiguous_samples
2021-11-03T17:37:53.2716777Z     self.assertEqual(actual_grad, expected_grad)
2021-11-03T17:37:53.2717669Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1903, in assertEqual
2021-11-03T17:37:53.2718435Z     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
2021-11-03T17:37:53.2719940Z AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 1 element(s) (out of 10) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 1.3113021850585938e-05 (-2.24330997467041 vs. -2.2432968616485596), which occurred at index 3.
2021-11-03T17:37:53.2721077Z 
2021-11-03T17:37:53.2721346Z ======================================================================
2021-11-03T17:37:53.2721884Z FAIL [0.014s]: test_noncontiguous_samples_matmul_cpu_float32 (__main__.TestCommonCPU)
2021-11-03T17:37:53.2722629Z ----------------------------------------------------------------------
2021-11-03T17:37:53.2723085Z Traceback (most recent call last):
2021-11-03T17:37:53.2723922Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 371, in instantiated_test
2021-11-03T17:37:53.2724671Z     result = test(self, **param_kwargs)
2021-11-03T17:37:53.2725471Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 737, in test_wrapper
2021-11-03T17:37:53.2726108Z     return test(*args, **kwargs)
2021-11-03T17:37:53.2726861Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_device_type.py", line 929, in only_fn
2021-11-03T17:37:53.2727474Z     return fn(self, *args, **kwargs)
2021-11-03T17:37:53.2728227Z   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1041, in wrapper
2021-11-03T17:37:53.2728793Z     fn(*args, **kwargs)

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

lezcano added a commit that referenced this pull request Sep 1, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

ghstack-source-id: f654d1d
Pull Request resolved: #64387
@lezcano lezcano requested review from ngimel and removed request for IvanYashchuk and nikitaved September 1, 2021 17:56
@lezcano lezcano added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Sep 1, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano

[ghstack-poisoned]
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Sep 2, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

ghstack-source-id: ef79746
Pull Request resolved: #64387
@mruberry mruberry added the module: performance Issues related to performance, either of kernel code or framework glue label Sep 2, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Sep 7, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

ghstack-source-id: 1e996bb
Pull Request resolved: #64387
@lezcano
Copy link
Collaborator Author

lezcano commented Sep 7, 2021

The following script:

from torch.utils.benchmark import Timer
import timeit
timer = Timer("torch::matmul(m1, m2);", setup="at::Tensor m1=torch::zeros({1,1,1});at::Tensor
m2=torch::zeros({1,1,1});", language="c++", timer=timeit.default_timer)
stats=timer.collect_callgrind(number=30, repeats=3)
print(stats[1].as_standardized().stats(inclusive=False))

Throws an instruction count of 951003 in master and of 872341 in this PR.

Using shapes (1, 1, 1) and (1, 1) it goes from 472891 (master) to 459918 (this PR).

A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano

[ghstack-poisoned]
@lezcano
Copy link
Collaborator Author

lezcano commented Sep 8, 2021

A more thorough instruction bench.

Shapes master This PR
(1,1,1), (1,1,1) 951003 870151
(1,1,1), (1,1) 472891 457008
(1,1,1), (1,) 585042 426104
(1,1), (1,1,1) 871951 791223
(1,), (1,1,1) 1020331 566204
(1,1), (1,1) 251204 249914
(1,1), (1,) 222644 221354
(1,), (1,1) 436218 354948
(1,), (1,) 195584 194324

Note in particular the 50% instructions in the case when the rhs is a batch of matrices and the lhs is a vector.

The reduction in the case when we use the out version is likely higher, but I did not bench that branch as it's not used nearly as much as the main branch.

This PR implements a number of general principles to try to save on unnecessary computations:

  • Make the code const-correct
  • Prefer DimVector over std::vector
  • Perform less operations on DimVector
  • Call mv rather than mm whenever possible
  • Call native::mv_out to avoid a dispatch
  • Work with the optional value from out directly to avoid a ref bump

This is ready for review @ngimel

A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Sep 8, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

ghstack-source-id: 4aa9e4a
Pull Request resolved: #64387
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request Sep 9, 2021
A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

ghstack-source-id: ec7134d
Pull Request resolved: #64387
Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay with review, this looks great, I have small questions about testing.

const Tensor output = at::_unsafe_view(at::mm_out(*out_opt, t1, tensor2), output_size);
return out_opt->set_(output);
} else {
const Tensor output = at::_unsafe_view(at::native::mv_out(t1, tensor2, *out_opt), output_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh, native::mv_out and at::mm_out couple lines above is pretty confusing (not to mention their different order of out argument)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know... There is no native::mm_out so I could not call it directly...

assertEqual(ans, expected)

out = torch.zeros(*shape, dtype=torch.int64).to(x.device)
out = torch.empty((0,), dtype=x.dtype, device=x.device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make tests more robust, you probably want to allocate out of the correct shape filled with nans here, but you got rid of the shape, so I don't know how feasible it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily we compute the non-out version just before, so we have the shapes there :)

A number of these include:

- Always prefer `DimVector` over `std::vector` when handling dimensions.
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.

This also includes an optimisation by dispatching vector matrix prodcuts
to `mv` rather than `mm`.

Further optimisations could include dispatching to lower level functions
when one of the inputs has shape `(n, 1)` or `(1, n)`.

cc @VitalyFedyunin ngimel heitorschueroff jianyuh nikitaved pearu mruberry walterddr @IvanYashchuk xwang233 @lezcano

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 11, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 12, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 12, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 13, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 13, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 13, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 13, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 17, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 17, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 18, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 18, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 18, 2022
…ic boogaloo"


This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
lezcano added a commit that referenced this pull request May 18, 2022
This PR implements the bulk of #64387

Part of the optimisations were already merged in #72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request May 18, 2022
This PR implements the bulk of
#64387

Part of the optimisations were already merged in
#72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

Pull Request resolved: #75197

Approved by: https://github.com/mruberry
facebook-github-bot pushed a commit that referenced this pull request May 20, 2022
Summary:
This PR implements the bulk of
#64387

Part of the optimisations were already merged in
#72230

A number of these optimisations include:
- Make the code `const` correct.
- Create `DimVector`'s more efficiently (e.g. prefer `append` over
`insert`).
- Access sizes of the tensors via `sizes().front()` / `sizes().back()`
  / `sizes().end()[-2]`
- Do not create intermediary tensors / vectors when it can be avoided.
- Call `reshape` rather than `expect_contiguous`  + `view`

On top of these, it fixes a correctness issue of `matmul_out`, where the
out parameter was not resized correctly when passed to the backends.
This involves removing the use of `set_` from the calling code, as
requested by ezyang, and it incurs on most of the complexity of the
code that this PR adds.

Pull Request resolved: #75197

Approved by: https://github.com/mruberry

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/ddb2eb7aee3b6812dde7bfc32fc0f26f5e916e6a

Reviewed By: seemethere

Differential Revision: D36494188

Pulled By: seemethere

fbshipit-source-id: 0f5270ceb14286bcc71e977980edfcea637625ba
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: performance Issues related to performance, either of kernel code or framework glue open source with-ssh

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants