[caffe2] Fix signal handler deleting siginfo_t in resulting Coredump#174247
[caffe2] Fix signal handler deleting siginfo_t in resulting Coredump#174247Jlalond wants to merge 1 commit intopytorch:mainfrom
Conversation
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @Jlalond, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
|
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174247
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit dc95547 with merge base 91ee748 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…ytorch#174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread pytorch#1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Differential Revision: D92093984
c07110e to
d098629
Compare
d098629 to
3e1f9d9
Compare
…174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread #1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Differential Revision: D92093984
3e1f9d9 to
feed064
Compare
…ytorch#174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread pytorch#1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Differential Revision: D92093984
feed064 to
058e8a0
Compare
…174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread #1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Differential Revision: D92093984
…ytorch#174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread pytorch#1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Reviewed By: yfeldblum, qxy11 Differential Revision: D92093984
058e8a0 to
dc95547
Compare
|
@pytorchbot merge (Initiating merge automatically since Phabricator Diff has merged) |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
This PR needs a
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…ytorch#174247) Summary: This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash. The culprit is the signal handler's call to `raise` after unregistering itself. Raise under the hood actually calls `tgkill` which replaces whatever the data into the `siginfo_t` with the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something like `SEGV sent by=your pid, your user` without the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal. This is a port of yfeldblum's change in [Folly Signal Handler](facebook/folly@79d7f8e) to caffe2. Test Plan: In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally ``` (lldb) thread siginfo thread pytorch#1: tid = 1711969, 0x000000000024f76a, name = 'signal_handler_', stop reason = SIGSEGV: address not mapped to object (fault address=0x1000) (__lldb_siginfo_t) __lldb_siginfo = { si_signo = 11 si_errno = 0 si_code = 1 __pad0 = 0 _sifields = { _kill = (si_pid = 4096, si_uid = 0) _timer = { si_tid = 4096 si_overrun = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _rt = { si_pid = 4096 si_uid = 0 si_sigval = (sival_int = 0, sival_ptr = 0x0000000000000000) } _sigchld = (si_pid = 4096, si_uid = 0, si_status = 0, si_utime = 0, si_stime = 0) _sigfault = { si_addr = 0x0000000000001000 si_addr_lsb = 0 _bounds = { _addr_bnd = (_lower = 0x0000000000000000, _upper = 0x0000000000000000) _pkey = 0 } } _sigpoll = (si_band = 4096, si_fd = 0) _sigsys = (_call_addr = 0x0000000000001000, _syscall = 0, _arch = 0) } } ``` And we see the siginfo contains the address which triggered the original SEGV. Differential Revision: D92093984 Pull Request resolved: pytorch#174247 Approved by: https://github.com/Skylion007
Summary:
This patch fixes the loss of signal info in Coredumps produced by caffe2 apps when they crash.
The culprit is the signal handler's call to
raiseafter unregistering itself. Raise under the hood actually callstgkillwhich replaces whatever the data into thesiginfo_twith the uid and pid of the calling process. This means when the signal and re-raised and the process coredumps, the reason for the coredump is something likeSEGV sent by=your pid, your userwithout the address info or the SI_CODE from the original signal. We fix this by calling raise signal directly with the original signal.This is a port of yfeldblum's change in Folly Signal Handler to caffe2.
Test Plan:
In the diff above this one creates a small app that loads the caffe2 app and then SEGV's. Then inspecting the core locally
And we see the siginfo contains the address which triggered the original SEGV.
Differential Revision: D92093984