Skip to content

Implement async-signal-safe lock for DwarfFDECache#34

Merged
al13n321 merged 1 commit intoClickHouse:masterfrom
amosbird:deadlock-fix
Dec 21, 2024
Merged

Implement async-signal-safe lock for DwarfFDECache#34
al13n321 merged 1 commit intoClickHouse:masterfrom
amosbird:deadlock-fix

Conversation

@amosbird
Copy link
Copy Markdown

@amosbird amosbird commented Dec 15, 2024

There are at least three types of stack unwinding in ClickHouse:

  1. Signal triggered unwinding

SIGUSR1: Used by the CPU Profiler
SIGUSR2: Used by the Real Profiler
SIGTRIM: Used by system.stack_trace

  1. Jemalloc profile unwinding

  2. Exception unwinding

During stack unwinding, libunwind manipulates the DwarfFDECache which acquires a read-write mutex (RWMutex). This is not async-signal-safe and can lead to serious issues. In particular, a signal handler might try to acquire the mutex while current thread is already in the process of unwinding and trying to acquire the same mutex , potentially leading to corruption of the mutex state and making debugging extremely difficult. A deadlock occurring during stack unwinding typically results in a "stop-the-world" situation.

An intuitive way to solve this problem is to block related signals before unwinding stack and unblock them later. However it's almost impossible to unblock signals later after exception unwinding.

Another way is to remove the use of pthread rw lock which is not async-signal-safe. This PR implements a simple lockfree RWMutex which should be good enough for stack unwinding purpose. It has been verified across a cluster of over 100 nodes. Prior to this implementation, the system experienced node deadlocks every 3-5 days. After applying the PR, the cluster has now been stable for two weeks without any reported issues.

Similar issue: ClickHouse/ClickHouse#69904

@al13n321 al13n321 self-assigned this Dec 20, 2024
Copy link
Copy Markdown
Member

@al13n321 al13n321 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, thanks for the fix!

(I suspect the implementation would be slightly simpler and faster if the two counters were packed into one 64-bit atomic, but it probably doesn't matter.)

while (true) {
int expected = 0;
if (atomic_compare_exchange_weak(&state, &expected, -1)) {
atomic_fetch_sub(&waiting_writers, 1);
Copy link
Copy Markdown
Member

@al13n321 al13n321 Dec 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick (EDIT: meh, merging without it):

Suggested change
atomic_fetch_sub(&waiting_writers, 1);
int current_waiting_writers = atomic_fetch_sub(&waiting_writers, 1);
if (current_waiting_writers <= 0) {
abort();
}

@al13n321 al13n321 merged commit 72fb634 into ClickHouse:master Dec 21, 2024
@hanfei1991
Copy link
Copy Markdown
Member

@al13n321 why is this PR not merged to master yet?

@al13n321
Copy link
Copy Markdown
Member

Oops, ClickHouse/ClickHouse#76107

@nickitat
Copy link
Copy Markdown
Member

Neat, thanks for the fix!

(I suspect the implementation would be slightly simpler and faster if the two counters were packed into one 64-bit atomic, but it probably doesn't matter.)

It'd probably make the correctness of the implementation more plausible if we keep all state (not just a variable called state:) atomic and change it always atomically.

@al13n321
Copy link
Copy Markdown
Member

Turns out DwarfFDECache is supposed to be disabled in clickhouse (using _LIBUNWIND_NO_HEAP), so we shouldn't be using this mutex at all: #35

@amosbird
Copy link
Copy Markdown
Author

Turns out DwarfFDECache is supposed to be disabled in clickhouse (using _LIBUNWIND_NO_HEAP), so we shouldn't be using this mutex at all: #35

I originally wanted to delete this cache too, but I'm worried that it might have a performance impact on stack unwinding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants