Fix autograd tracking of pointwise fallback and broadcasted shapes #1826

gchanan · 2017-06-16T20:20:59Z

Fixes pointwise fallback autograd shape tracking
This refers to cases where the number of elements are the same, but the shapes don't match and broadcasting didn't happen (i.e. this predates broadcasting). Autograd did this correctly for add, sub, mul, div, but didn't do it for the other ops with this behavior, e.g: addcmul, addcdiv, lerp, le, lt, ge, gt, ne, eq, min, max, masked_scatter. Also added some shape check code to test_autograd.
Fix broadcast autograd shape tracking
This is equivalent to the case above, but for cases where broadcasting occurred. This includes the functions in 1), but also functions that don't have pointwise fallback, but broadcast, e.g. addmm, addmv, addbmm, baddbmm, addr.
Fixes other autograd shape tracking bugs
expand and masked_scatter had bugs, these should be fixed.
Disallows div_ with tensor or variable parameter
This is now the same as mul_; before, it would call DivConstant which was incorrect.

Also add size checks to test_autograd to try to catch such issues.

Fix autograd broadcast for addmm, baddmm, others.

Also clean up CompareOp autograd backwards impl.

gchanan · 2017-06-16T20:23:25Z

This should fix #1787

Running the example from there with this PR:

>>> a = Variable(torch.randn(1), requires_grad=True)
>>> b = Variable(torch.randn(5,4), requires_grad=True)
>>> (a * b).sum().backward()
>>> print(a.size(), a.grad.size())
torch.Size([1]) torch.Size([1])
>>> print(b.size(), b.grad.size())
torch.Size([5, 4]) torch.Size([5, 4])

gchanan · 2017-06-16T22:11:44Z

The latest commit should fix #1813.

test/test_autograd.py

+    (Remainder, (), ((S, S, S), Variable(torch.randn(S, S, S) + 2.5, requires_grad=False)), 'tensor'),
+    (Remainder, (), ((S, S, S), Variable(torch.randn(S) + 2.5, requires_grad=False)), 'tensor_broadcast_rhs'),
+    (Remainder, (), ((S,), Variable(torch.randn(S, S, S) + 2.5, requires_grad=False)), 'tensor_broadcast_lhs'),
+    (Remainder, (), ((S, 1, S), Variable(torch.randn(S, S) + 2.5, requires_grad=False)), 'tensor_broadcast_all'),


torch/autograd/_functions/compare.py

-        return grad_input, (grad_input if ctx.b_tensor else None)
+
+        def maybe_unexpand_or_view_if_tensor(tensor, size):
+            return tensor if (tensor is None or size is None) else maybe_unexpand_or_view(tensor, size)


2) Use better approach for avoiding divide-by-0 in autograd tests.

…db4c72 Summary: Previous import was 4c091e048ca42682d63ccd3c1811560bc12b732d Included changes: - **[e18bb41](onnx/onnx@e18bb41)**: Infer shape of the second output of Dropout op (pytorch#1822) <Shinichiro Hamaji> - **[cb544d0](onnx/onnx@cb544d0)**: Clarify dtype of Dropout's mask output (pytorch#1826) <Shinichiro Hamaji> - **[b60f693](onnx/onnx@b60f693)**: Fix shape inference when auto_pad is notset (pytorch#1824) <Li-Wen Chang> - **[80346bd](onnx/onnx@80346bd)**: update test datat (pytorch#1825) <Rui Zhu> - **[b37fc6d](onnx/onnx@b37fc6d)**: Add stringnormalizer operator to ONNX (pytorch#1745) <Dmitri Smirnov> Differential Revision: D14206264 fbshipit-source-id: 1c747a9ea4d33455e6826ba733a46155323bc96f

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser ghstack-source-id: 3745722 Pull Request resolved: #83067

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser [ghstack-poisoned]

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: #83067 Approved by: https://github.com/davidberard98

Summary: Pull Request resolved: #83067 Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream #81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D38543000 Pulled By: davidberard98 fbshipit-source-id: 752edbfbced14fe01b84e417f23cc941b2148842

gchanan added 19 commits June 16, 2017 12:10

Remove unnecesssary squeezing in Expand backwards.

afeca66

Also add size checks to test_autograd to try to catch such issues.

Fix autograd for broadcasting with add, sub, mul, div.

d240fd6

Make implementation of Variable.mul_ and Variable.div_ consistent.

89840a2

Fix pointwise fallback for addcdiv, addcmul.

9e3d07a

Fix addcmul, addcdiv autograd broadcasting.

d3c0dbb

Fix addr, addmm, baddmm, addmvm, addbmm broadcasting with autograd.

2e1fe9b

Fix autograd broadcast for addmm, baddmm, others.

Fix pow autograd broadcast.

1f70d6e

Fix pointwise fallback for lerp.

8e6592f

Fix lerp broadcast autograd.

1befba2

Fix fmod/remainder autograd broadcasting.

d438824

Fix broadcast and pointwise compare ops with autograd.

e517edd

Automatically detect when to skip inplace tests and fix lint.

bc30de9

Fix autograd pointwise fallback for max,min.

ac42ed8

Fix autograd broadcast for min, max.

56f5241

Add broadcast autograd tests for dist.

fbd55fe

Fix autograd broadcasting for masked_fill.

0acdf73

Fix masked_scatter pointwise autograd backward behavior.

9eca3cf

Fix masked_scatter autograd broadcasting.

d4a0b62

Avoid nans in fmod/remainder tensor tests.

2c72b09

Also clean up CompareOp autograd backwards impl.

soumith added the in progress label Jun 16, 2017

gchanan force-pushed the fix_expand_autograd branch from a0fbdf8 to 2c72b09 Compare June 16, 2017 20:39

Fix autograd shape tracking for 1-d reduction ops.

a4acf23

apaszke approved these changes Jun 16, 2017

View reviewed changes

1) Simplify CompareOp autograd backward

71104fb

2) Use better approach for avoiding divide-by-0 in autograd tests.

soumith closed this Jun 17, 2017

soumith reopened this Jun 17, 2017

soumith added in progress and removed in progress labels Jun 17, 2017

soumith merged commit db70d4d into pytorch:master Jun 17, 2017

soumith removed the in progress label Jun 17, 2017

houseroad mentioned this pull request Feb 25, 2019

Automatic update of fbcode/onnx to e18bb41d255a23daf368ffd62a2645db55db4c72 #17460

Closed

jjsjann123 pushed a commit to jjsjann123/pytorch that referenced this pull request Aug 6, 2022

Nonaffine swizzle formulation ep.2: Loop swizzle variant. (pytorch#1826)

501f4aa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix autograd tracking of pointwise fallback and broadcasted shapes #1826

Fix autograd tracking of pointwise fallback and broadcasted shapes #1826

Uh oh!

gchanan commented Jun 16, 2017

Uh oh!

gchanan commented Jun 16, 2017

Uh oh!

gchanan commented Jun 16, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix autograd tracking of pointwise fallback and broadcasted shapes #1826

Fix autograd tracking of pointwise fallback and broadcasted shapes #1826

Uh oh!

Conversation

gchanan commented Jun 16, 2017

Uh oh!

gchanan commented Jun 16, 2017

Uh oh!

gchanan commented Jun 16, 2017

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants