[BE][Attention] Code de-dup #139784

malfet · 2024-11-05T18:57:32Z

Stack from ghstack (oldest at bottom):

The only difference between convert_boolean_attn_mask_cudnn and convert_boolean_attn_mask is the value we initialize boolean tensor to
Reduce duplication by introducing convert_boolean_attn_mask_ that takes neg_inf value and make abovementioned implementations are trivial oneline call
Also, as suggested by @Skylion007, replace at::where(foo->logical_not, -inf, 0) with at::where(*foo, 0, -inf)

The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]

pytorch-bot · 2024-11-05T18:57:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139784

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2ed9300 with merge base 4d5cc1b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call ghstack-source-id: 2d316b4 Pull Request resolved: #139784

Skylion007

optimization nit

aten/src/ATen/native/transformers/attention.cpp

The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]

drisspg · 2024-11-05T20:22:43Z

aten/src/ATen/native/transformers/attention.cpp

  // to mask *out*.
  if (attn_mask->dtype() == at::kBool) {
-    return at::where(attn_mask->logical_not(), -std::numeric_limits<double>::infinity(), at::scalar_tensor(0.0, at::TensorOptions().dtype(dtype).device(attn_mask->device())));
+    return at::where(*attn_mask, 0.0, at::scalar_tensor(neg_inf, at::TensorOptions().dtype(dtype).device(attn_mask->device())));


we should probably also have 0.0 be a scalar tensor right? I am not sure why this didnt also need to be updated

I suspect only one scalar_tensor is needed to preserve out dtype, other can be inferred from the first.

The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implienetations are trivial oneline call [ghstack-poisoned]

malfet · 2024-11-05T22:25:27Z

@pytorchbot merge

pytorchmergebot · 2024-11-05T22:27:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call Also, as suggested by Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)` [ghstack-poisoned]

pytorchmergebot · 2024-11-05T22:32:21Z

Merge failed

Reason: New commits were pushed while merging. Please rerun the merge command.

Details for Dev Infra team

Raised by workflow job

malfet · 2024-11-06T01:15:35Z

@pytorchbot merge

pytorchmergebot · 2024-11-06T01:17:14Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: #139763 Approved by: https://github.com/Skylion007 ghstack dependencies: #139788, #139784

As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes #139389 Pull Request resolved: #139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #139788, #139784, #139763

@Skylion007

The only difference between `convert_boolean_attn_mask_cudnn` and `convert_boolean_attn_mask` is the value we initialize boolean tensor to Reduce duplication by introducing `convert_boolean_attn_mask_` that takes `neg_inf` value and make abovementioned implementations are trivial oneline call Also, as suggested by @Skylion007, replace `at::where(foo->logical_not, -inf, 0)` with `at::where(*foo, 0, -inf)` Pull Request resolved: pytorch#139784 Approved by: https://github.com/Skylion007, https://github.com/drisspg ghstack dependencies: pytorch#139788

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784

…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784

…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763

malfet mentioned this pull request Nov 5, 2024

[BE][Attention] Use isneginf #139763

Closed

Skylion007 reviewed Nov 5, 2024

View reviewed changes

aten/src/ATen/native/transformers/attention.cpp Outdated Show resolved Hide resolved

Skylion007 approved these changes Nov 5, 2024

View reviewed changes

malfet mentioned this pull request Nov 5, 2024

[BE][Attention] Factor out common code #139788

Closed

malfet mentioned this pull request Nov 5, 2024

[MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors #139791

Closed

drisspg reviewed Nov 5, 2024

View reviewed changes

drisspg approved these changes Nov 5, 2024

View reviewed changes

malfet added the topic: not user facing topic category label Nov 5, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 5, 2024

pytorchmergebot added the merging label Nov 5, 2024

pytorchmergebot removed the merging label Nov 5, 2024

pytorchmergebot added the merging label Nov 6, 2024

pytorchmergebot added the Merged label Nov 6, 2024

pytorchmergebot closed this in bd45c00 Nov 6, 2024

pytorchmergebot removed the merging label Nov 6, 2024

github-actions bot deleted the gh/malfet/48/head branch December 6, 2024 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE][Attention] Code de-dup #139784

[BE][Attention] Code de-dup #139784

Uh oh!

malfet commented Nov 5, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

Uh oh!

Skylion007 left a comment

Uh oh!

Uh oh!

drisspg Nov 5, 2024

Uh oh!

malfet Nov 5, 2024

Uh oh!

malfet commented Nov 5, 2024

Uh oh!

pytorchmergebot commented Nov 5, 2024

Uh oh!

pytorchmergebot commented Nov 5, 2024

Uh oh!

malfet commented Nov 6, 2024

Uh oh!

pytorchmergebot commented Nov 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[BE][Attention] Code de-dup #139784

[BE][Attention] Code de-dup #139784

Uh oh!

Conversation

malfet commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139784

✅ No Failures

Uh oh!

Skylion007 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drisspg Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

malfet Nov 5, 2024

Choose a reason for hiding this comment

Uh oh!

malfet commented Nov 5, 2024

Uh oh!

pytorchmergebot commented Nov 5, 2024

Merge started

Uh oh!

pytorchmergebot commented Nov 5, 2024

Merge failed

Uh oh!

malfet commented Nov 6, 2024

Uh oh!

pytorchmergebot commented Nov 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

malfet commented Nov 5, 2024 •

edited

Loading

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading