Skip to content

Fix crash in libunwind while interpreting debug info#60468

Merged
alexey-milovidov merged 25 commits intomasterfrom
libunwind-fix-crash
May 21, 2024
Merged

Fix crash in libunwind while interpreting debug info#60468
alexey-milovidov merged 25 commits intomasterfrom
libunwind-fix-crash

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix a crash in asynchronous stack unwinding (such as when using the sampling query profiler) while interpreting debug info. This closes #60460.

Interesting if this will help.

@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Feb 27, 2024
@robot-clickhouse-ci-2
Copy link
Copy Markdown
Contributor

robot-clickhouse-ci-2 commented Feb 27, 2024

This is an automated comment for commit 29f7a76 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc❌ failure
Successful checks
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
ClickBenchRuns [ClickBench](https://github.com/ClickHouse/ClickBench/) with instant-attach table✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker keeper imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docker server imageThe check to build and optionally push the mentioned image to docker hub✅ success
Docs checkBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@al13n321 al13n321 self-assigned this Feb 28, 2024
@azat
Copy link
Copy Markdown
Member

azat commented Mar 28, 2024

Stress test (debug) — Found signal in gdb.log

Looks like not all places are guarded:

2024-03-11 11:27:02 Thread 1392 "QueryPipelineEx" received signal SIGSEGV, Segmentation fault.
2024-03-11 11:27:02 [Switching to Thread 0x7f9c846e9640 (LWP 5197)]
2024-03-11 11:27:02 0x0000000026b22370 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister (addressSpace=..., registers=..., cfa=cfa@entry=48, savedReg=...) at /build/contrib/libunwind/src/DwarfInstructions.hpp:98
2024-03-11 11:27:02 #0  0x0000000026b22370 in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::getSavedRegister (addressSpace=..., registers=..., cfa=cfa@entry=48, savedReg=...) at /build/contrib/libunwind/src/DwarfInstructions.hpp:98
2024-03-11 11:27:02 No locals.
2024-03-11 11:27:02 #1  0x0000000026b2050d in libunwind::DwarfInstructions<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarf (addressSpace=..., pc=<optimized out>, fdeStart=<optimized out>, registers=..., isSignalFrame=@0x7f9c846d66b1: true) at /build/contrib/libunwind/src/DwarfInstructions.hpp:246
2024-03-11 11:27:02         i = 6
2024-03-11 11:27:02         newRegisters = {_registers = {__rax = 1, __rbx = 140310213424425, __rcx = 140328309704872, __rdx = 0, __rdi = 140328309704872, __rsi = 140328309704872, __rbp = 32, __rsp = 48, __r8 = 388732109466758666, __r9 = 1442695040888963407, __r10 = 6364136223846793005, __r11 = 582, __r12 = 140310213465664, __r13 = 0, __r14 = 4865540595714422341, __r15 = 2549297995355413924, __rip = <optimized out>, __rflags = 0, __cs = 0, __fs = 0, __gs = 140310213390112}}
2024-03-11 11:27:02         cfa = 48
2024-03-11 11:27:02         returnAddress = 0
2024-03-11 11:27:02         lastReg = 32
2024-03-11 11:27:02         prolog = {cfaRegister = 6, cfaRegisterOffset = 16, cfaExpression = 0, spExtraArgSize = 0, savedRegisters = {{location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterInCFA, initialStateSaved = true, value = -16}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterInCFA, initialStateSaved = true, value = -8}, {location = libunwind::CFI_Parser<libunwind::LocalAddressSpace>::kRegisterUnused, initialStateSaved = false, value = 0} <repeats 16 times>}}
2024-03-11 11:27:02         fdeInfo = {fdeStart = 113127456, fdeLength = 32, fdeInstructions = 113127473, pcStart = 326253632, pcEnd = 326253668, lsda = 0}
2024-03-11 11:27:02         cieInfo = {cieStart = 96318352, cieLength = 24, cieInstructions = 96318369, pointerEncoding = 27 '\033', lsdaEncoding = 255 '\377', personalityEncoding = 0 '\000', personalityOffsetInCIE = 0 '\000', personality = 0, codeAlignFactor = 1, dataAlignFactor = -8, isSignalFrame = false, fdesHaveAugmentationData = true, returnAddressRegister = 16 '\020'}
2024-03-11 11:27:02 #2  0x0000000026b1fee4 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::stepWithDwarfFDE (this=0x7f9c846d65b0) at /build/contrib/libunwind/src/UnwindCursor.hpp:1002
2024-03-11 11:27:02 No locals.
2024-03-11 11:27:02 #3  libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step (this=0x7f9c846d65b0) at /build/contrib/libunwind/src/UnwindCursor.hpp:2849
2024-03-11 11:27:02         result = <optimized out>
2024-03-11 11:27:02 #4  0x0000000026b1f5c7 in unw_backtrace (buffer=0x7f9c846d6870, size=45) at /build/contrib/libunwind/src/libunwind.cpp:352
2024-03-11 11:27:02         context = {data = {140310213388400, 140310213424425, 0, 140310213388736, 140310213387528, 45, 45, 140310213387520, 0, 2221775936, 8, 582, 140310213465664, 0, 140310213388400, 10273862115808062644, 649196864, 0, 0, 0, 140310213390112}}
2024-03-11 11:27:02         cursor = {data = {660527232, 667202496, 1, 140310213424425, 140328309704872, 0, 140328309704872, 140328309704872, 32, 140310213393272, 388732109466758666, 1442695040888963407, 6364136223846793005, 582, 140310213465664, 0, 4865540595714422341, 2549297995355413924, 326253667, 0, 0, 0, 140310213390112, 326253632, 326253668, 0, 0, 0, 0, 137506062336, 113127456, 178268096, 256}}
2024-03-11 11:27:02         ip = 326253667
2024-03-11 11:27:02         current = 15
2024-03-11 11:27:02 #5  0x00000000136db8aa in StackTrace::tryCapture (this=0x7f9c846d6860) at /build/src/Common/StackTrace.cpp:349
2024-03-11 11:27:02 No locals.
2024-03-11 11:27:02 #6  0x00000000136db7a0 in StackTrace::StackTrace (this=0x7f9c846d6860, signal_context=...) at /build/src/Common/StackTrace.cpp:318
2024-03-11 11:27:02         caller_address = 0x7f9c846d6860
2024-03-11 11:27:02 #7  0x0000000013713426 in std::__1::construct_at[abi:v15000]<StackTrace, ucontext_t const&, StackTrace*>(StackTrace*, ucontext_t const&) (__location=0x7f9c846d6860, __args=...) at /build/contrib/llvm-project/libcxx/include/__memory/construct_at.h:35
2024-03-11 11:27:02 No locals.
2024-03-11 11:27:02 #8  0x0000000013713367 in std::__1::__optional_storage_base<StackTrace, false>::__construct[abi:v15000]<ucontext_t const&>(ucontext_t const&) (this=0x7f9c846d6860, __args=...) at /build/contrib/llvm-project/libcxx/include/optional:374
2024-03-11 11:27:02 No locals.
2024-03-11 11:27:02 #9  0x000000001371314a in std::__1::optional<StackTrace>::emplace[abi:v15000]<ucontext_t const&, void>(ucontext_t const&) (this=0x7f9c846d6860, __args=...) at /build/contrib/llvm-project/libcxx/include/optional:898
2024-03-11 11:27:02 No locals.
2024-03-11 11:27:02 #10 0x00000000137118d0 in DB::(anonymous namespace)::writeTraceInfo (trace_type=DB::TraceType::Real, info=0x7f9c846d6f70, context=0x7f9c846d6e40) at /build/src/Common/QueryProfiler.cpp:97

@alexey-milovidov
Copy link
Copy Markdown
Member Author

@azat, this is just a report from gdb about a signal. After this change, we allow the segmentation of fault signals at runtime when they are produced inside libunwind.

@azat
Copy link
Copy Markdown
Member

azat commented May 14, 2024

FWIW this PR makes unwidning at least 2x faster (due to removing mincore I guess), the test involves too much extra stuff, but still it shows the improvement.

Details

Preparation

cat > cat config.d/overrides.yml <<<EOL
logger:
  level: none
EOL

root@bb17d67109a1:~/.server# clickhouse-server -- --trace_log=
Processing configuration file 'config.xml'.
There is no file 'config.xml', will use embedded config.
Merging configuration file 'config.d/overrides.yml'.
chc -q 'truncate table system.trace_log'; time clickhouse-benchmark -t 30 -c 300 -q 'select sleep(3)' --query_profiler_cpu_time_period_ns=0 --query_profiler_real_time_period_ns=1; chc -nm -q 'system flush logs; select count() from system.trace_log;'

Below I will mention period, this is sampling period -

period = std::max<UInt32>(period, 1000000);

upstream

period=1000000

Stopping launch of queries. Requested time limit 30 seconds is exhausted.
Queries executed: 2929.
localhost:9000, queries: 2929, QPS: 88.673, RPS: 88.673, MiB/s: 0.000, result RPS: 88.673, result MiB/s: 0.000.

0.000%          3.002 sec.
10.000%         3.009 sec.
20.000%         3.012 sec.
30.000%         3.015 sec.
40.000%         3.018 sec.
50.000%         3.023 sec.
60.000%         3.030 sec.
70.000%         3.043 sec.
80.000%         3.069 sec.
90.000%         3.286 sec.
95.000%         3.617 sec.
99.000%         5.013 sec.
99.900%         6.153 sec.
99.990%         6.477 sec.



real    0m33.106s
user    0m7.724s
sys     0m2.787s
1'331'387

period=100000

Stopping launch of queries. Requested time limit 30 seconds is exhausted.
Queries executed: 451.
localhost:9000, queries: 451, QPS: 12.055, RPS: 12.055, MiB/s: 0.000, result RPS: 12.055, result MiB/s: 0.000.

0.000%          3.058 sec.
10.000%         3.399 sec.
20.000%         7.653 sec.
30.000%         11.116 sec.
40.000%         17.415 sec.
50.000%         24.347 sec.
60.000%         27.344 sec.
70.000%         29.608 sec.
80.000%         31.865 sec.
90.000%         33.944 sec.
95.000%         35.551 sec.
99.000%         36.496 sec.
99.900%         36.995 sec.
99.990%         36.995 sec.



real    0m37.487s
user    0m3.245s
sys     0m1.738s
951637

Stopping launch of queries. Requested time limit 30 seconds is exhausted.
Queries executed: 541.
localhost:9000, queries: 541, QPS: 14.101, RPS: 14.101, MiB/s: 0.000, result RPS: 14.101, result MiB/s: 0.000.

0.000%          3.078 sec.
10.000%         3.553 sec.
20.000%         5.609 sec.
30.000%         7.059 sec.
40.000%         9.604 sec.
50.000%         14.985 sec.
60.000%         25.467 sec.
70.000%         27.233 sec.
80.000%         29.867 sec.
90.000%         34.375 sec.
95.000%         35.447 sec.
99.000%         37.696 sec.
99.900%         37.812 sec.
99.990%         37.909 sec.



real    0m38.445s
user    0m3.552s
sys     0m1.705s
1005665

SIGSEGV via signal

period=100000

Stopping launch of queries. Requested time limit 30 seconds is exhausted.
Queries executed: 2290.
localhost:9000, queries: 2290, QPS: 68.366, RPS: 68.366, MiB/s: 0.000, result RPS: 68.366, result MiB/s: 0.000.

0.000%          3.117 sec.
10.000%         3.298 sec.
20.000%         3.353 sec.
30.000%         3.407 sec.
40.000%         3.455 sec.
50.000%         3.499 sec.
60.000%         3.564 sec.
70.000%         3.651 sec.
80.000%         3.900 sec.
90.000%         4.625 sec.
95.000%         5.276 sec.
99.000%         6.288 sec.
99.900%         8.210 sec.
99.990%         8.317 sec.



real    0m33.570s
user    0m4.267s
sys     0m2.093s
2333543

I guess it is not merged yet, because it crashes CI? Maybe we should remap SIGSEGV into a different signal i.e. SIGBUS, inside clickhouse, and force gdb to ignore SIGSEGV, that way the CI will not crash, since this patch is really needed. @alexey-milovidov thoughts?

P.S. I've been looking into the original issue, tried lots of things, have lots of details but so far don't understood is it possible to fix it somehow properly.

@azat
Copy link
Copy Markdown
Member

azat commented May 15, 2024

Maybe we should remap SIGSEGV into a different signal i.e. SIGBUS, inside clickhouse, and force gdb to ignore SIGSEGV, that way the CI will not crash, since this patch is really needed.

@alexey-milovidov Or we can simply rebase after #63865, and the problem with CI should be gone.

@azat
Copy link
Copy Markdown
Member

azat commented May 16, 2024

Although maybe mincore (or any quirk for reading invalid addresses) will be still needed, if the original problem was due to wrong DWARF, but will see

@alexey-milovidov
Copy link
Copy Markdown
Member Author

FYI this pull request is primarily for speed - because mincore is extremely slow in some environments.
It was also an attempt to generalize the ways to protect is from crashes, but it no longer seems relevant.

@alexey-milovidov alexey-milovidov merged commit 3c0d35e into master May 21, 2024
@alexey-milovidov alexey-milovidov deleted the libunwind-fix-crash branch May 21, 2024 02:03
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label May 21, 2024
@CheSema
Copy link
Copy Markdown
Member

CheSema commented Jun 13, 2024

This commit breaks the Stateless tests!
Look at this: https://s3.amazonaws.com/clickhouse-test-reports/61047/07e746d5a4ef7f858872a622d4e13b38ac24133a/stateless_tests__aarch64_.html

Sigsev has happen in libunwind, all the program stopped by gdb. Gdb collected the traces and after that program just continued to executing. But for the time while it has been paused, some tests got timeouts.

@alexey-milovidov
Copy link
Copy Markdown
Member Author

#65247

@antonio2368 antonio2368 mentioned this pull request Jul 22, 2024
19 tasks
yokofly added a commit to timeplus-io/proton that referenced this pull request Jan 22, 2026
…d update)

* ClickHouse/ClickHouse#63865 from azat/fix-query-profiler-SIGSEGV

Fix SIGSEGV due to CPU/Real profiler

* ClickHouse/ClickHouse#60468 from ClickHouse/libunwind-fix-crash

Fix crash in libunwind while interpreting debug info

* ClickHouse/ClickHouse#65509 Update libunwind to 18.1.7

* ClickHouse/ClickHouse#66850 from ClickHouse/revert-libunwind-patch

Revert libunwind patch

* ClickHouse/ClickHouse#66977 from ClickHouse/uwo

Apply libunwind fix

* ClickHouse/ClickHouse#68312 from ClickHouse/muslwind

Apply libunwind changes needed for musl

* ClickHouse/ClickHouse#76107 from ClickHouse/owo

Apply libunwind fix for DwarfFDECache

* ClickHouse/ClickHouse#76136 from ClickHouse/revert-76107-owo

Revert "Apply libunwind fix for DwarfFDECache"

* ClickHouse/ClickHouse#76178 from ClickHouse/unw

Apply libunwind fix for DwarfFDECache, attempt 2

* Update libunwind to aec8e58

* ClickHouse/ClickHouse#67152 from ClickHouse/ohno

Uncomment accidentally commented out code in QueryProfiler

* ClickHouse/ClickHouse#64058  aarch64 sigaltstack size fix

* ClickHouse/ClickHouse#58607 Added null guards to avoid potential demangle crashes

---------

Co-authored-by: Nikita Mikhaylov <[email protected]>
Co-authored-by: Alexey Milovidov <[email protected]>
Co-authored-by: Michael Kolupaev <[email protected]>
Co-authored-by: Antonio Andelic <[email protected]>
Co-authored-by: Max Kainov <[email protected]>
yokofly added a commit to timeplus-io/proton that referenced this pull request Jan 23, 2026
…d update) (#11324)

* ClickHouse/ClickHouse#63865 from azat/fix-query-profiler-SIGSEGV

Fix SIGSEGV due to CPU/Real profiler

* ClickHouse/ClickHouse#60468 from ClickHouse/libunwind-fix-crash

Fix crash in libunwind while interpreting debug info

* ClickHouse/ClickHouse#65509 Update libunwind to 18.1.7

* ClickHouse/ClickHouse#66850 from ClickHouse/revert-libunwind-patch

Revert libunwind patch

* ClickHouse/ClickHouse#66977 from ClickHouse/uwo

Apply libunwind fix

* ClickHouse/ClickHouse#68312 from ClickHouse/muslwind

Apply libunwind changes needed for musl

* ClickHouse/ClickHouse#76107 from ClickHouse/owo

Apply libunwind fix for DwarfFDECache

* ClickHouse/ClickHouse#76136 from ClickHouse/revert-76107-owo

Revert "Apply libunwind fix for DwarfFDECache"

* ClickHouse/ClickHouse#76178 from ClickHouse/unw

Apply libunwind fix for DwarfFDECache, attempt 2

* Update libunwind to aec8e58

* ClickHouse/ClickHouse#67152 from ClickHouse/ohno

Uncomment accidentally commented out code in QueryProfiler

* ClickHouse/ClickHouse#64058  aarch64 sigaltstack size fix

* ClickHouse/ClickHouse#58607 Added null guards to avoid potential demangle crashes

---------

Co-authored-by: Nikita Mikhaylov <[email protected]>
Co-authored-by: Alexey Milovidov <[email protected]>
Co-authored-by: Michael Kolupaev <[email protected]>
Co-authored-by: Antonio Andelic <[email protected]>
Co-authored-by: Max Kainov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

libunwind: getSavedRegister: invalid read from memory

6 participants