Illegal Memory Access w/ efficient Attention + compile

# Summary

The base repro: 
`PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 python benchmark.py --model maxvit_nano_rw_256 --precision bfloat16 --torchcompile --bench train --no-retry -b 64 `


This is producing :
```Shell
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/drisspg/meta/pytorch/c10/cuda/CUDAException.cpp:43 (most
frame #6: at::native::_efficient_attention_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, std::optional<double>, std::optional<long>, std::optional<long>, bool) + 0x2338 (0x7f0a1e35b418 in /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x344cbc7 (0x7f0a1e44cbc7 in /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x347bb1d (0x7f0a1e47bb1d in /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so)
frame #9: at::_ops::_efficient_attention_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, std::optional<double>, std::optional<long>, std::optional<long>, bool) + 0x350 (0x7f0a2b292490 in /home/drisspg/meta/pytorch/torch/lib/libtorch_cpu.so)
frame #10: at::native::_scaled_dot_product_efficient_attention_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, std::array<bool, 4ul>, bool, std::optional<double>) + 0x2b3 (0x7f0a1e1bef23 in 
```

## Current Debugging
Verified that this is specific to inductor:
1. No error in Eager
2. No error when compiling with: `aot_eager_decomp_partitioner`

The actual IMA is an out of bounds read.

### Smaller ish repro
https://gist.github.com/drisspg/49d5bec4fdadeace206455267e1ef135
IF you run this with `PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1` you should hit the same IMA



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Illegal Memory Access w/ efficient Attention + compile #138772

Summary

Current Debugging

Smaller ish repro

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Illegal Memory Access w/ efficient Attention + compile #138772

Description

Summary

Current Debugging

Smaller ish repro

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions