Fix SIGSEGV due to CPU/Real profiler#63865
Conversation
The problem was due to incorrect unwinding due from signal handlers,
which leads to incorrect DWARF (FDE/CIE) interpretation.
After this patch I was not able to reproduce the crash for couple of
hours, while before it was very stable (I've reduced the minimal
threshold for query_profiler_real_time_period_ns), using simply:
$ clickhouse-benchmark --port 19000 -q "SELECT * FROM remote('127.{1..10}', system, one)" --query_profiler_real_time_period_ns=1
Note, I'm using here remote() for fibers, that has stack with guard
pages that helps with reproducing the crash more faster.
P.S. I also have another implementation of this fix, without patching
unwind and using info from signal context directly, and even though it
is better, because you don't need to trip extra frames and you can use
all the 45 frames for something useful, it is too complex, so let's go
with a simpler patch first, and I think it could be even backported.
Signed-off-by: Azat Khuzhin <[email protected]>
|
This is an automated comment for commit 94041d1 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
|
@azat Let's fix all CI failures. |
|
Failures does not looks related, most of them are flaky
Looks like some hang in - will take a look later
The error for the query is But it fails on the left server anyway (i.e. upstream/master), and sometimes it's execution took ~15 seconds, will take a look later. The query is |
|
Backports? |
|
Agree, make sense (it should not make anything worse) |
Backport #63865 to 24.3: Fix SIGSEGV due to CPU/Real profiler
Backport #63865 to 24.4: Fix SIGSEGV due to CPU/Real profiler
Backport #63865 to 24.2: Fix SIGSEGV due to CPU/Real profiler
…d update) * ClickHouse/ClickHouse#63865 from azat/fix-query-profiler-SIGSEGV Fix SIGSEGV due to CPU/Real profiler * ClickHouse/ClickHouse#60468 from ClickHouse/libunwind-fix-crash Fix crash in libunwind while interpreting debug info * ClickHouse/ClickHouse#65509 Update libunwind to 18.1.7 * ClickHouse/ClickHouse#66850 from ClickHouse/revert-libunwind-patch Revert libunwind patch * ClickHouse/ClickHouse#66977 from ClickHouse/uwo Apply libunwind fix * ClickHouse/ClickHouse#68312 from ClickHouse/muslwind Apply libunwind changes needed for musl * ClickHouse/ClickHouse#76107 from ClickHouse/owo Apply libunwind fix for DwarfFDECache * ClickHouse/ClickHouse#76136 from ClickHouse/revert-76107-owo Revert "Apply libunwind fix for DwarfFDECache" * ClickHouse/ClickHouse#76178 from ClickHouse/unw Apply libunwind fix for DwarfFDECache, attempt 2 * Update libunwind to aec8e58 * ClickHouse/ClickHouse#67152 from ClickHouse/ohno Uncomment accidentally commented out code in QueryProfiler * ClickHouse/ClickHouse#64058 aarch64 sigaltstack size fix * ClickHouse/ClickHouse#58607 Added null guards to avoid potential demangle crashes --------- Co-authored-by: Nikita Mikhaylov <[email protected]> Co-authored-by: Alexey Milovidov <[email protected]> Co-authored-by: Michael Kolupaev <[email protected]> Co-authored-by: Antonio Andelic <[email protected]> Co-authored-by: Max Kainov <[email protected]>
…d update) (#11324) * ClickHouse/ClickHouse#63865 from azat/fix-query-profiler-SIGSEGV Fix SIGSEGV due to CPU/Real profiler * ClickHouse/ClickHouse#60468 from ClickHouse/libunwind-fix-crash Fix crash in libunwind while interpreting debug info * ClickHouse/ClickHouse#65509 Update libunwind to 18.1.7 * ClickHouse/ClickHouse#66850 from ClickHouse/revert-libunwind-patch Revert libunwind patch * ClickHouse/ClickHouse#66977 from ClickHouse/uwo Apply libunwind fix * ClickHouse/ClickHouse#68312 from ClickHouse/muslwind Apply libunwind changes needed for musl * ClickHouse/ClickHouse#76107 from ClickHouse/owo Apply libunwind fix for DwarfFDECache * ClickHouse/ClickHouse#76136 from ClickHouse/revert-76107-owo Revert "Apply libunwind fix for DwarfFDECache" * ClickHouse/ClickHouse#76178 from ClickHouse/unw Apply libunwind fix for DwarfFDECache, attempt 2 * Update libunwind to aec8e58 * ClickHouse/ClickHouse#67152 from ClickHouse/ohno Uncomment accidentally commented out code in QueryProfiler * ClickHouse/ClickHouse#64058 aarch64 sigaltstack size fix * ClickHouse/ClickHouse#58607 Added null guards to avoid potential demangle crashes --------- Co-authored-by: Nikita Mikhaylov <[email protected]> Co-authored-by: Alexey Milovidov <[email protected]> Co-authored-by: Michael Kolupaev <[email protected]> Co-authored-by: Antonio Andelic <[email protected]> Co-authored-by: Max Kainov <[email protected]>
The problem was due to incorrect unwinding due from signal handlers, which leads to incorrect DWARF (FDE/CIE) interpretation.
After this patch I was not able to reproduce the crash for couple of hours, while before it was very stable (I've reduced the minimal threshold for query_profiler_real_time_period_ns), using simply:
Note, I'm using here remote() for fibers, that has stack with guard pages that helps with reproducing the crash more faster.
P.S. I also have another implementation of this fix, without patching unwind and using info from signal context directly, and even though it is better, because you don't need to trip extra frames and you can use all the 45 frames for something useful, it is too complex, so let's go with a simpler patch first, and I think it could be even backported.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix SIGSEGV due to CPU/Real (
query_profiler_real_time_period_ns/query_profiler_cpu_time_period_ns) profiler (has been an issue since 2022, that leads to periodic server crashes, especially if you were using distributed engine)Fixes: #60460
Fixes: #33531
Fixes: #60219
Refs: ClickHouse/libunwind#25
Refs: llvm/llvm-project#92291