-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[BE][Attention] Use isneginf
#139763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BE][Attention] Use isneginf
#139763
Conversation
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/139763
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 6f60f47 with merge base 4d5cc1b ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@malfet Seems like we need a NestedTensor shim for isinf functions:
|
|
+1 @Skylion007 on needing NST support. Pretty straightforward to add for the pointwise ops |
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call [ghstack-poisoned]
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call [ghstack-poisoned]
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call [ghstack-poisoned]
| return map_nt(self, at::logical_not); | ||
| } | ||
|
|
||
| Tensor NestedTensor_isneginf(const Tensor& self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we add isinf and the other missing functions while we are at it?
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks.
Test plan: Run
```python
from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler
import torch
import time
vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0",
subfolder='vae',
torch_dtype=torch.bfloat16,
force_upcast=False).to('mps')
pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae,
torch_dtype=torch.bfloat16, variant="fp16").to('mps')
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
start_time = time.time()
start_mps_mem = torch.mps.driver_allocated_memory()
image = pipe(prompt="Spherical cow in vacuum",
num_inference_steps=10,
guidance_scale=8,
generator=torch.Generator("mps").manual_seed(42),
).images[0]
end_mps_mem = torch.mps.driver_allocated_memory()
run_time = time.time() - start_time
print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb")
image.save(f'bfloat16.png')
```
Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:

Fixes #139389
Pull Request resolved: #139791
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #139788, #139784, #139763
Follow up to some issues @malfet's recent PR pointed out about missing ops #139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: #139890 Approved by: https://github.com/malfet
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784
…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:  Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763
Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784
…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:  Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763
Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784
…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:  Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763
Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet
May be I'm missing some vital piece of information, but it feels like ```c++ const auto neg_inf = at::scalar_tensor(-std::numeric_limits<float>::infinity(), at::TensorOptions().dtype(out.dtype()).device(out.device())); const auto masked = self.eq(neg_inf); ``` should be equivalent to [`torch.isneginf`](https://pytorch.org/docs/stable/generated/torch.isneginf.html) call Pull Request resolved: pytorch#139763 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784
…139791) As MacOS-15 or newer supports those out of the box. This significantly reduces memory requirements and improves performance for some stable diffision networks. Test plan: Run ```python from diffusers import StableDiffusionXLPipeline, AutoencoderKL, EulerAncestralDiscreteScheduler import torch import time vae = AutoencoderKL.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder='vae', torch_dtype=torch.bfloat16, force_upcast=False).to('mps') pipe = StableDiffusionXLPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", vae=vae, torch_dtype=torch.bfloat16, variant="fp16").to('mps') pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config) start_time = time.time() start_mps_mem = torch.mps.driver_allocated_memory() image = pipe(prompt="Spherical cow in vacuum", num_inference_steps=10, guidance_scale=8, generator=torch.Generator("mps").manual_seed(42), ).images[0] end_mps_mem = torch.mps.driver_allocated_memory() run_time = time.time() - start_time print(f"run time in {run_time:.2f} sec, end_mps_mem {end_mps_mem/1024.0**2:.2f} Mb mem increase {(end_mps_mem-start_time)/1024.0**2:.2f} Mb") image.save(f'bfloat16.png') ``` Before the change total memory use were 16Gb and needed 65 sec to complete, after it drops down to 14Gb and takes 50 sec to finish on M2Pro, though generated image remains the same:  Fixes pytorch#139389 Pull Request resolved: pytorch#139791 Approved by: https://github.com/drisspg, https://github.com/Skylion007 ghstack dependencies: pytorch#139788, pytorch#139784, pytorch#139763
Follow up to some issues @malfet's recent PR pointed out about missing ops pytorch#139763. Tried to mirror it to other important nearby ops. Seems like we could automate / autogen this more for generic pointwise ops like this. Pull Request resolved: pytorch#139890 Approved by: https://github.com/malfet
This reverts commit 157c18a.
Stack from ghstack (oldest at bottom):
isneginf#139763May be I'm missing some vital piece of information, but it feels like
should be equivalent to
torch.isneginfcall