Make Ctrl-C handling thread-safe (#7603)#7617
Make Ctrl-C handling thread-safe (#7603)#7617mikulas-patocka wants to merge 3 commits intoZ3Prover:masterfrom
Conversation
The Ctrl-C handling is not thread safe, there's a global variable g_obj
that is being accessed without any locking. The signal handlers are
per-process, not per-thread, so that different threads step over each
other's handlers. It is unpredictable in which thread the signal handler
runs, so the handler may race with the scoped_ctrl_c destructor.
Fix this by introducing the functions signal_lock and signal_unlock.
signal_lock blocks the SIGINT signal and then takes a mutex (so that the
signal handler can't be called while the mutex is held). signal_unlock
drops the mutex and restores the signal mask.
We protect all the global variables with signal_lock and signal_unlock.
Note that on Windows, the SIGINT handler is being run in a separate
thread (and there is no way how to block it), so we can use a simple
mutex to synchronize the signal handler with the other threads.
In class cancel_eh, the operator () may be called concurrently by the
timer code and the Ctrl-C code, but the operator () accesses class'
members without any locking. Fix this race condition by using the
functions signal_lock() and signal_unlock().
There is this possible call trace:
SIGINT signal
on_sigint
a->m_cancel_eh()
cancel_eh::operator()
m_obj.inc_cancel
reslimit::inc_cancel
lock_guard lock(*g_rlimit_mux);
Here we take a mutex from a signal - this is subject to deadlock (if the
signal interrupted another piece of code where the mutex is already
held).
To fix this race, we remove g_rlimit_mux and replace it with
signal_lock() and signal_unlock(). signal_lock and signal_unlock block
the signal before grabbing the mutex, so the signal can't interrupt a
piece of code where the mutex is held and the deadlock won't happen.
Signed-off-by: Mikulas Patocka <[email protected]>
|
I am not following:
The g_rlimit_mux is used exclusively inside the rlimit class and for touching the m_cancel flag + updating children. |
The SIGINT signal could interrupt it. According to the POSIX specification, when the user presses Ctrl+C, the SIGINT signal may be delivered to any thread in the process (the only exception is that the signal couldn't be delivered to a thread that has the signal blocked). So, we grab a mutex, execute some code in the rlimit class with the mutex held, the user presses Ctrl+C, the signal handler is delivered to the thread that already holds the mutex, the signal handler attempts to grab the mutex again and it deadlocks. In order to avoid this deadlock possibility, we need to block the SIGINT signal before we grab the mutex. It's a similar pattern to spinlocks and interrupt masking inside the kernel. If we grab a spinlock and the spinlock may be grabbed by an interrupt, there is deadlock risk. In order to avoid this deadlock, we must block interrupts before grabbing the spinlock and re-enable interrupts after releasing the spinlock. |
| m_canceled = true; | ||
| m_obj.inc_cancel(); | ||
| } | ||
| signal_unlock(); |
There was a problem hiding this comment.
this is a bit peculiar. When m_canceled is true you force an interrupt by using signal_lock, but I don't see how it constrains behaviors.
|
Nuno, I am inclined to merge pending an understanding to my question on why there is a lock being taken in cancel_eh. |
|
Please give me a couple of days to measure perf & read the patch. I'm under a bit under water, sorry.. |
|
Doesn't compile: |
remove stale field
|
@nunoplopes - this would just be a stale field from a merge. Would you be able to check it? |
|
Sorry for the delay.
This protocol is more like a reference counting mechanism that sets a global handler when there's at least one pending handler, and gets rid of the global handler when no more handlers are active. |
|
see #7755 |
The Ctrl-C handling is not thread safe, there's a global variable g_obj that is being accessed without any locking. The signal handlers are per-process, not per-thread, so that different threads step over each other's handlers. It is unpredictable in which thread the signal handler runs, so the handler may race with the scoped_ctrl_c destructor.
Fix this by introducing the functions signal_lock and signal_unlock. signal_lock blocks the SIGINT signal and then takes a mutex (so that the signal handler can't be called while the mutex is held). signal_unlock drops the mutex and restores the signal mask.
We protect all the global variables with signal_lock and signal_unlock.
Note that on Windows, the SIGINT handler is being run in a separate thread (and there is no way how to block it), so we can use a simple mutex to synchronize the signal handler with the other threads.
In class cancel_eh, the operator () may be called concurrently by the timer code and the Ctrl-C code, but the operator () accesses class' members without any locking. Fix this race condition by using the functions signal_lock() and signal_unlock().
There is this possible call trace:
SIGINT signal
on_sigint
a->m_cancel_eh()
cancel_eh::operator()
m_obj.inc_cancel
reslimit::inc_cancel
lock_guard lock(*g_rlimit_mux);
Here we take a mutex from a signal - this is subject to deadlock (if the signal interrupted another piece of code where the mutex is already held).
To fix this race, we remove g_rlimit_mux and replace it with signal_lock() and signal_unlock(). signal_lock and signal_unlock block the signal before grabbing the mutex, so the signal can't interrupt a piece of code where the mutex is held and the deadlock won't happen.