[MPS] Gather sliced inputs to batch norm #134121

pytorchbot · 2024-08-21T16:55:12Z

This PR removes the executeGatherOp flag from batch norm in favor of relying on the logic in

pytorch/aten/src/ATen/native/mps/OperationUtils.mm

Line 372 in 4aa66f6

if (needsGather(src) && gatherTensorData) {

to decide if gathering is necessary.

It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs.

Performance impact

With fix

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 282 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 448 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 705 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 1.11 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.16 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 11.7 msec per loop

Without fix

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 284 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 265 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 715 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 675 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.19 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 7.13 msec per loop

Please feel free to push back or request changes.

Fixes #133520

This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in https://github.com/pytorch/pytorch/blob/4aa66f68a803927ddd127ceaaa1521b8d6e90e5f/aten/src/ATen/native/mps/OperationUtils.mm#L372 to decide if gathering is necessary. It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs. ### Performance impact #### With fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 282 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 448 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 705 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 1.11 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.16 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 11.7 msec per loop ``` #### Without fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 284 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 265 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 715 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 675 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.19 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 7.13 msec per loop ``` Please feel free to push back or request changes. Fixes #133520 Pull Request resolved: #133610 Approved by: https://github.com/malfet (cherry picked from commit 43f78bf)

pytorch-bot · 2024-08-21T16:55:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134121

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 20 Unrelated Failures

As of commit 01780d2 with merge base b66e3f0 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge) (gh)
test_repeat_truncated
pull / linux-focal-py3.12-clang10 / test (default, 2, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_outside_linear_module_free_dynamic_shapes

CANCELLED JOB - The following job was cancelled. Please retry:

Mac MPS / macos-py3-arm64-mps / test (mps, 1, 1, macos-m2-14) (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh) (similar failure)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-focal-cuda12.4-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge.nvidia.gpu) (gh) (similar failure)
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh) (disabled by #128551)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, linux.2xlarge) (gh) (disabled by #128551)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-focal-py3.8-clang10 / test (dynamo, 2, 3, linux.2xlarge) (gh) (disabled by #128551)
test_dataloader.py::TestDataLoader::test_segfault
pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorchbot requested review from kulinseth and malfet as code owners August 21, 2024 16:55

This was referenced Aug 21, 2024

[v2.4.1] Release Tracker #132400

Closed

[MPS] Gather sliced inputs to batch norm #133610

Closed

pytorch-bot bot added ciflow/mps Run MPS tests (subset of trunk) release notes: mps Release notes category labels Aug 21, 2024

pytorchbot added the open source label Aug 21, 2024

atalman approved these changes Aug 22, 2024

View reviewed changes

atalman merged commit eba4d08 into release/2.4 Aug 22, 2024

atalman deleted the cherry-pick-133610-by-pytorch_bot_bot_ branch August 22, 2024 17:39

hvaara mentioned this pull request Aug 27, 2024

[MPS] BatchNorm2D produces incorrect results for column first tensors #134580

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Gather sliced inputs to batch norm #134121

[MPS] Gather sliced inputs to batch norm #134121

Uh oh!

pytorchbot commented Aug 21, 2024

Uh oh!

pytorch-bot bot commented Aug 21, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[MPS] Gather sliced inputs to batch norm #134121

[MPS] Gather sliced inputs to batch norm #134121

Uh oh!

Conversation

pytorchbot commented Aug 21, 2024

Performance impact

With fix

Without fix

Uh oh!

pytorch-bot bot commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134121

❌ 2 New Failures, 1 Cancelled Job, 20 Unrelated Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pytorch-bot bot commented Aug 21, 2024 •

edited

Loading