Skip to content

Make Ctrl-C handling thread-safe (#7603)#7617

Closed
mikulas-patocka wants to merge 3 commits intoZ3Prover:masterfrom
mikulas-patocka:ctrl-c-races
Closed

Make Ctrl-C handling thread-safe (#7603)#7617
mikulas-patocka wants to merge 3 commits intoZ3Prover:masterfrom
mikulas-patocka:ctrl-c-races

Conversation

@mikulas-patocka
Copy link
Copy Markdown
Contributor

The Ctrl-C handling is not thread safe, there's a global variable g_obj that is being accessed without any locking. The signal handlers are per-process, not per-thread, so that different threads step over each other's handlers. It is unpredictable in which thread the signal handler runs, so the handler may race with the scoped_ctrl_c destructor.

Fix this by introducing the functions signal_lock and signal_unlock. signal_lock blocks the SIGINT signal and then takes a mutex (so that the signal handler can't be called while the mutex is held). signal_unlock drops the mutex and restores the signal mask.

We protect all the global variables with signal_lock and signal_unlock.

Note that on Windows, the SIGINT handler is being run in a separate thread (and there is no way how to block it), so we can use a simple mutex to synchronize the signal handler with the other threads.

In class cancel_eh, the operator () may be called concurrently by the timer code and the Ctrl-C code, but the operator () accesses class' members without any locking. Fix this race condition by using the functions signal_lock() and signal_unlock().

There is this possible call trace:
SIGINT signal
on_sigint
a->m_cancel_eh()
cancel_eh::operator()
m_obj.inc_cancel
reslimit::inc_cancel
lock_guard lock(*g_rlimit_mux);

Here we take a mutex from a signal - this is subject to deadlock (if the signal interrupted another piece of code where the mutex is already held).

To fix this race, we remove g_rlimit_mux and replace it with signal_lock() and signal_unlock(). signal_lock and signal_unlock block the signal before grabbing the mutex, so the signal can't interrupt a piece of code where the mutex is held and the deadlock won't happen.

The Ctrl-C handling is not thread safe, there's a global variable g_obj
that is being accessed without any locking. The signal handlers are
per-process, not per-thread, so that different threads step over each
other's handlers. It is unpredictable in which thread the signal handler
runs, so the handler may race with the scoped_ctrl_c destructor.

Fix this by introducing the functions signal_lock and signal_unlock.
signal_lock blocks the SIGINT signal and then takes a mutex (so that the
signal handler can't be called while the mutex is held). signal_unlock
drops the mutex and restores the signal mask.

We protect all the global variables with signal_lock and signal_unlock.

Note that on Windows, the SIGINT handler is being run in a separate
thread (and there is no way how to block it), so we can use a simple
mutex to synchronize the signal handler with the other threads.

In class cancel_eh, the operator () may be called concurrently by the
timer code and the Ctrl-C code, but the operator () accesses class'
members without any locking. Fix this race condition by using the
functions signal_lock() and signal_unlock().

There is this possible call trace:
        SIGINT signal
        on_sigint
        a->m_cancel_eh()
        cancel_eh::operator()
        m_obj.inc_cancel
        reslimit::inc_cancel
        lock_guard lock(*g_rlimit_mux);

Here we take a mutex from a signal - this is subject to deadlock (if the
signal interrupted another piece of code where the mutex is already
held).

To fix this race, we remove g_rlimit_mux and replace it with
signal_lock() and signal_unlock(). signal_lock and signal_unlock block
the signal before grabbing the mutex, so the signal can't interrupt a
piece of code where the mutex is held and the deadlock won't happen.

Signed-off-by: Mikulas Patocka <[email protected]>
@NikolajBjorner
Copy link
Copy Markdown
Contributor

NikolajBjorner commented Apr 18, 2025

I am not following:

(if the signal interrupted another piece of code where the mutex is already held).

The g_rlimit_mux is used exclusively inside the rlimit class and for touching the m_cancel flag + updating children.
What action within rlimit would cause a signal? The main unsafe operation I see is memory allocation which could page fault and even fail.

@mikulas-patocka
Copy link
Copy Markdown
Contributor Author

I am not following:

(if the signal interrupted another piece of code where the mutex is already held).

The g_rlimit_mux is used exclusively inside the rlimit class and for touching the m_cancel flag + updating children. What action within rlimit would cause a signal? The main unsafe operation I see is memory allocation which could page fault and even fail.

The SIGINT signal could interrupt it. According to the POSIX specification, when the user presses Ctrl+C, the SIGINT signal may be delivered to any thread in the process (the only exception is that the signal couldn't be delivered to a thread that has the signal blocked).

So, we grab a mutex, execute some code in the rlimit class with the mutex held, the user presses Ctrl+C, the signal handler is delivered to the thread that already holds the mutex, the signal handler attempts to grab the mutex again and it deadlocks.

In order to avoid this deadlock possibility, we need to block the SIGINT signal before we grab the mutex.

It's a similar pattern to spinlocks and interrupt masking inside the kernel. If we grab a spinlock and the spinlock may be grabbed by an interrupt, there is deadlock risk. In order to avoid this deadlock, we must block interrupts before grabbing the spinlock and re-enable interrupts after releasing the spinlock.

m_canceled = true;
m_obj.inc_cancel();
}
signal_unlock();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit peculiar. When m_canceled is true you force an interrupt by using signal_lock, but I don't see how it constrains behaviors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mikulas-patocka - I still have this question

@NikolajBjorner
Copy link
Copy Markdown
Contributor

Nuno, I am inclined to merge pending an understanding to my question on why there is a lock being taken in cancel_eh.
Perf-wise, I take this code isn't on any critical path so doesn't affect Alive2 or other use cases (other than fixing thread safety).

@nunoplopes
Copy link
Copy Markdown
Collaborator

Please give me a couple of days to measure perf & read the patch. I'm under a bit under water, sorry..

@nunoplopes
Copy link
Copy Markdown
Collaborator

Doesn't compile:

../src/util/scoped_ctrl_c.cpp: In constructor ‘scoped_ctrl_c::scoped_ctrl_c(event_handler&, bool, bool)’:
../src/util/scoped_ctrl_c.cpp:110:5: error: class ‘scoped_ctrl_c’ does not have any field named ‘m_old_scoped_ctrl_c’
  110 |     m_old_scoped_ctrl_c(g_obj) {
      |     ^~~~~~~~~~~~~~~~~~~
src/math/polynomial/rpolynomial.cpp
../src/util/scoped_ctrl_c.cpp:110:25: error: ‘g_obj’ was not declared in this scope
  110 |     m_old_scoped_ctrl_c(g_obj) {
      |                         ^~~~~

remove stale field
@NikolajBjorner
Copy link
Copy Markdown
Contributor

@nunoplopes - this would just be a stale field from a merge. Would you be able to check it?

@nunoplopes
Copy link
Copy Markdown
Collaborator

Sorry for the delay.
I was thinking about this, and the protocol seems overly complicated.
My suggestion:

  • Have a global old handler
  • When the the first ctrl-c constructor is called, it arranges a signal handler and records the old handler
  • When subsequent constructors are called, they add themselves to the list of handlers
  • On signal delivery, the global handler triggers all handlers
  • On destructor call, if it's the last one, reverse the global signal handler, otherwise just remove itself from the list

This protocol is more like a reference counting mechanism that sets a global handler when there's at least one pending handler, and gets rid of the global handler when no more handlers are active.

@NikolajBjorner
Copy link
Copy Markdown
Contributor

see #7755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants