Skip to content

Illegal Memory Access w/ efficient Attention + compile #138772

@drisspg

Description

@drisspg

Summary

The base repro:
PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 python benchmark.py --model maxvit_nano_rw_256 --precision bfloat16 --torchcompile --bench train --no-retry -b 64

This is producing :

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /home/drisspg/meta/pytorch/c10/cuda/CUDAException.cpp:43 (most
frame #6: at::native::_efficient_attention_backward(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, long, long, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, std::optional<double>, std::optional<long>, std::optional<long>, bool) + 0x2338 (0x7f0a1e35b418 in /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x344cbc7 (0x7f0a1e44cbc7 in /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x347bb1d (0x7f0a1e47bb1d in /home/drisspg/meta/pytorch/torch/lib/libtorch_cuda.so)
frame #9: at::_ops::_efficient_attention_backward::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, c10::SymInt, c10::SymInt, at::Tensor const&, double, at::Tensor const&, at::Tensor const&, long, bool, std::optional<double>, std::optional<long>, std::optional<long>, bool) + 0x350 (0x7f0a2b292490 in /home/drisspg/meta/pytorch/torch/lib/libtorch_cpu.so)
frame #10: at::native::_scaled_dot_product_efficient_attention_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, double, std::array<bool, 4ul>, bool, std::optional<double>) + 0x2b3 (0x7f0a1e1bef23 in 

Current Debugging

Verified that this is specific to inductor:

  1. No error in Eager
  2. No error when compiling with: aot_eager_decomp_partitioner

The actual IMA is an out of bounds read.

Smaller ish repro

https://gist.github.com/drisspg/49d5bec4fdadeace206455267e1ef135
IF you run this with PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 you should hit the same IMA

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: inductoroncall: pt2triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions