[BE][Attention] Use `isneginf` #139763

malfet · 2024-11-05T16:58:46Z

Stack from ghstack (oldest at bottom):

May be I'm missing some vital piece of information, but it feels like

  const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device()));
  const auto masked = self.eq(neg_inf);

should be equivalent to torch.isneginf call

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call [ghstack-poisoned]

pytorch-bot · 2024-11-05T16:58:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139763

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6f60f47 with merge base 4d5cc1b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2024-11-05T18:13:20Z

@malfet Seems like we need a NestedTensor shim for isinf functions:

pytorch/aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp

Line 88 in 41e4d88

Tensor NestedTensor_logical_not(const Tensor& self) {

?

drisspg · 2024-11-05T20:17:19Z

+1 @Skylion007 on needing NST support. Pretty straightforward to add for the pointwise ops

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call [ghstack-poisoned]

Skylion007 · 2024-11-05T22:50:24Z

aten/src/ATen/native/nested/NestedTensorUnaryOps.cpp

  return map_nt(self, at::logical_not);
 }

+Tensor NestedTensor_isneginf(const Tensor& self) {


nit: can we add isinf and the other missing functions while we are at it?

malfet · 2024-11-06T01:47:40Z

@pytorchbot merge

pytorchmergebot · 2024-11-06T01:49:30Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes #139389 Pull Request resolved: #139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: #139788, #139784, #139763

@malfet

Follow up to some issues @malfet's recent PR pointed out about missing ops #139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: #139890 Approved by: https://github.com/malfet

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784

…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763

@malfet

Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784

…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763

@malfet

Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784

…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763

@malfet

Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet

May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784

…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same: ![image](https://github.com/user-attachments/assets/1a35efef-9f80-4cd0-ac9c-30203eab6bb1) Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763

@malfet

Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet

This reverts commit 157c18a.

malfet requested a review from drisspg November 5, 2024 16:59

malfet added the topic: not user facing topic category label Nov 5, 2024

Skylion007 approved these changes Nov 5, 2024

View reviewed changes

Skylion007 added the better-engineering Relatively self-contained tasks for better engineering contributors label Nov 5, 2024

malfet mentioned this pull request Nov 5, 2024

[BE][Attention] Code de-dup #139784

Closed

malfet changed the title ~~[BE][Transformers] Use isneginf~~ [BE][Attention] Use isneginf Nov 5, 2024

This was referenced Nov 5, 2024

[BE][Attention] Factor out common code #139788

Closed

[MPS][Perf] Dispatch to SDP-math-mps for non-contig Tensors #139791

Closed

Skylion007 reviewed Nov 5, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 6, 2024

pytorchmergebot added the merging label Nov 6, 2024

pytorchmergebot added the Merged label Nov 6, 2024

pytorchmergebot closed this in 157c18a Nov 6, 2024

pytorchmergebot removed the merging label Nov 6, 2024

Skylion007 mentioned this pull request Nov 6, 2024

[BE]: Add NT missing fp classification functions #139890

Closed

github-actions bot deleted the gh/malfet/47/head branch December 8, 2024 02:17

jeffhataws added a commit to jeffhataws/pytorch that referenced this pull request Mar 27, 2025

Revert "[BE][Attention] Use isneginf (pytorch#139763)"

0e76bc0

This reverts commit 157c18a.

ysiraichi mentioned this pull request Mar 31, 2025

Lower isneginf(). pytorch/xla#8912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BE][Attention] Use `isneginf` #139763

[BE][Attention] Use `isneginf` #139763

Uh oh!

malfet commented Nov 5, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

Uh oh!

Skylion007 commented Nov 5, 2024 •

edited

Loading

Uh oh!

drisspg commented Nov 5, 2024

Uh oh!

Skylion007 Nov 5, 2024 •

edited

Loading

Uh oh!

malfet commented Nov 6, 2024

Uh oh!

pytorchmergebot commented Nov 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[BE][Attention] Use isneginf #139763

[BE][Attention] Use isneginf #139763

Uh oh!

Conversation

malfet commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139763

✅ No Failures

Uh oh!

Skylion007 commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drisspg commented Nov 5, 2024

Uh oh!

Skylion007 Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet commented Nov 6, 2024

Uh oh!

pytorchmergebot commented Nov 6, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[BE][Attention] Use `isneginf` #139763

[BE][Attention] Use `isneginf` #139763

malfet commented Nov 5, 2024 •

edited

Loading

pytorch-bot bot commented Nov 5, 2024 •

edited

Loading

Skylion007 commented Nov 5, 2024 •

edited

Loading

Skylion007 Nov 5, 2024 •

edited

Loading