Skip to content

Conversation

@atalman
Copy link
Contributor

@atalman atalman commented May 24, 2022

This prevents import torch accidentally crash on machines with no metal devices

Should prevent crashes reported in #77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true

Backtrace to the crash:

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23
    frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
    frame #2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125
    frame #3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535
    frame #4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up
frame #1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl:
->  0x10fd9f524 <+436>: movq   %rax, 0x1b0(%rbx)
    0x10fd9f52b <+443>: movw   $0x0, 0x1b8(%rbx)
    0x10fd9f534 <+452>: addq   $0x8, %rsp
    0x10fd9f538 <+456>: popq   %rbx
(lldb) disassemble
 ...
    0x10fd9f514 <+420>: movq   0xf19ad15(%rip), %rsi     ; "maxBufferLength"
    0x10fd9f51b <+427>: movq   %r14, %rdi
    0x10fd9f51e <+430>: callq  *0xeaa326c(%rip)          ; (void *)0x00007fff7202be40: objc_msgSend

which corresponds to [m_device maxBufferLength] call, where m_device is not initialized in

m_total_allocated_memory(0), m_max_buffer_size([m_device maxBufferLength]),

Pull Request resolved: #78136
Approved by: https://github.com/seemethere

…78136)

This prevents `import torch` accidentally crash on machines with no metal devices

Should prevent crashes reported in pytorch#77662 (comment) and https://github.com/pytorch/functorch/runs/6560056366?check_suite_focus=true

Backtrace to the crash:
```
(lldb) bt
* thread pytorch#1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7202be57 libobjc.A.dylib`objc_msgSend + 23
    frame pytorch#1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
    frame pytorch#2: 0x000000010fda011d libtorch_cpu.dylib`_GLOBAL__sub_I_MPSAllocator.mm + 125
    frame pytorch#3: 0x000000010ada81e3 dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 535
    frame pytorch#4: 0x000000010ada85ee dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40(lldb) up
frame pytorch#1: 0x000000010fd9f524 libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl() + 436
libtorch_cpu.dylib`at::mps::HeapAllocator::MPSHeapAllocatorImpl::MPSHeapAllocatorImpl:
->  0x10fd9f524 <+436>: movq   %rax, 0x1b0(%rbx)
    0x10fd9f52b <+443>: movw   $0x0, 0x1b8(%rbx)
    0x10fd9f534 <+452>: addq   $0x8, %rsp
    0x10fd9f538 <+456>: popq   %rbx
(lldb) disassemble
 ...
    0x10fd9f514 <+420>: movq   0xf19ad15(%rip), %rsi     ; "maxBufferLength"
    0x10fd9f51b <+427>: movq   %r14, %rdi
    0x10fd9f51e <+430>: callq  *0xeaa326c(%rip)          ; (void *)0x00007fff7202be40: objc_msgSend
```

which corresponds to `[m_device maxBufferLength]` call, where `m_device` is not initialized in
https://github.com/pytorch/pytorch/blob/2ae3c59e4bcb8e6e75b4a942cacc2d338c88e609/aten/src/ATen/mps/MPSAllocator.h#L171

Pull Request resolved: pytorch#78136
Approved by: https://github.com/seemethere
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 24, 2022

🔗 Helpful links

❌ 1 New Failures

As of commit affdb65 (more details on the Dr. CI page):

Expand to see more
  • 1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages

See GitHub Actions build pull / linux-bionic-rocm5.1-py3.7 / test (default, 2, 2, linux.rocm.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2022-05-24T22:57:45.8325649Z RuntimeError: test_sparse_csr failed!
2022-05-24T22:57:42.3587604Z 
2022-05-24T22:57:42.3587865Z Generating XML reports...
2022-05-24T22:57:42.6221853Z Generated XML report: test-reports/python-unittest/test_sparse_csr/TEST-TestSparseCSRCUDA-20220524225709.xml
2022-05-24T22:57:42.6223824Z Generated XML report: test-reports/python-unittest/test_sparse_csr/TEST-TestSparseCSRSampler-20220524225709.xml
2022-05-24T22:57:42.6793143Z Generated XML report: test-reports/python-unittest/test_sparse_csr/TEST-TestSparseCompressedCUDA-20220524225709.xml
2022-05-24T22:57:45.8312019Z Traceback (most recent call last):
2022-05-24T22:57:45.8312826Z   File "test/run_test.py", line 1074, in <module>
2022-05-24T22:57:45.8318550Z     main()
2022-05-24T22:57:45.8319225Z   File "test/run_test.py", line 1052, in main
2022-05-24T22:57:45.8324897Z     raise RuntimeError(err_message)
2022-05-24T22:57:45.8325649Z RuntimeError: test_sparse_csr failed!
2022-05-24T22:57:47.8139845Z 
2022-05-24T22:57:47.8140864Z real	42m42.613s
2022-05-24T22:57:47.8141481Z user	77m7.536s
2022-05-24T22:57:47.8142019Z sys	46m11.644s
2022-05-24T22:57:47.8142571Z + cleanup
2022-05-24T22:57:47.8143108Z + retcode=1
2022-05-24T22:57:47.8143656Z + set +x
2022-05-24T22:57:47.8268985Z ##[error]Process completed with exit code 1.
2022-05-24T22:57:47.8346166Z ##[group]Run # copy test results back to the mounted workspace, needed sudo, resulting permissions were correct
2022-05-24T22:57:47.8347160Z �[36;1m# copy test results back to the mounted workspace, needed sudo, resulting permissions were correct�[0m

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@atalman atalman merged commit 2ad18ab into pytorch:release/1.12 May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants