Skip to content

Do not silently ignore write errors#19451

Merged
alexey-milovidov merged 4 commits intoClickHouse:masterfrom
azat:safe-writes
Feb 11, 2021
Merged

Do not silently ignore write errors#19451
alexey-milovidov merged 4 commits intoClickHouse:masterfrom
azat:safe-writes

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jan 22, 2021

If this PR will the culprit of abnormal termination it is better to add proper next() call in the caller, instead of reverting the whole PR

NOTE: some connections resets in some badly covered bits may terminate the server (unlikely, but possible I guess), but you cannot find all such places manually anyway

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Do not silently ignore write errors

Detailed description / Documentation draft:
Since this hides real problems, since destructor does final flush and if it fails, then data will be lost.

One of such examples if MEMORY_LIMIT_EXCEEDED exception.
Here is an example, that leads to empty block in the distributed batch (#19449):

    2021.01.21 12:43:18.619739 [ 46468 ] {7bd60d75-ebcb-45d2-874d-260df9a4ddac} <Error> virtual DB::CompressedWriteBuffer::~CompressedWriteBuffer(): Code: 241, e.displayText() = DB::Exception: Memory limit (for user) exceeded: would use 332.07 GiB (attempt to allocate chunk of 4355342 bytes), maximum: 256.00 GiB, Stack trace (when copying this message, always include the lines below):

    0. DB::Exception::Exception<>() @ 0x86f7b88 in /usr/bin/clickhouse
    ...
    4. void DB::PODArrayBase<>::resize<>(unsigned long) @ 0xe9e878d in /usr/bin/clickhouse
    5. DB::CompressedWriteBuffer::nextImpl() @ 0xe9f0296 in /usr/bin/clickhouse
    6. DB::CompressedWriteBuffer::~CompressedWriteBuffer() @ 0xe9f0415 in /usr/bin/clickhouse
    7. DB::DistributedBlockOutputStream::writeToShard() @ 0xf6bed4a in /usr/bin/clickhouse

P.S. since it is pretty huge, marked as Improvement, not Bug fix.

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Jan 22, 2021
@azat azat marked this pull request as draft January 22, 2021 19:11
@azat azat force-pushed the safe-writes branch 3 times, most recently from ece74d1 to 881e1a1 Compare January 22, 2021 21:36
Copy link
Copy Markdown
Member

@alexey-milovidov alexey-milovidov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexey-milovidov alexey-milovidov self-assigned this Jan 22, 2021
@azat
Copy link
Copy Markdown
Member Author

azat commented Jan 22, 2021

And AFAICS MergeTree engine is safe

@azat azat force-pushed the safe-writes branch 5 times, most recently from bbe498a to 1e16bd6 Compare January 24, 2021 10:47
@azat
Copy link
Copy Markdown
Member Author

azat commented Jan 25, 2021

First part (do not ignore fatal write errors from destructors) - done

@alexey-milovidov
Copy link
Copy Markdown
Member

Stress test (address) — Killed by signal (in clickhouse-server.log)
Stress test (memory) — Killed by signal (in clickhouse-server.log)
Stress test (undefined) — Killed by signal (in clickhouse-server.log)

Cannot merge yet.

@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 2, 2021

Stress test (address) — Killed by signal (in clickhouse-server.log)
Stress test (memory) — Killed by signal (in clickhouse-server.log)
Stress test (undefined) — Killed by signal (in clickhouse-server.log)

@alexey-milovidov this had been fixed, but what bothers me about this, is that some connections resets in some badly covered bits may terminate the server (unlikely, but possible I guess), but you cannot find all such places manually anyway...

Integration tests (thread) — fail: 1, passed: 1082, error: 0

Addressed here - ac510f9

azat added 4 commits February 5, 2021 01:29
Since this hides real problems, since destructor does final flush and if
it fails, then data will be lost.

One of such examples if MEMORY_LIMIT_EXCEEDED exception, so lock
exceptions from destructors, by using
MemoryTracker::LockExceptionInThread to block these exception, and allow
others (so std::terminate will be called, since this is c++11 with
noexcept for destructors by default).

Here is an example, that leads to empty block in the distributed batch:

    2021.01.21 12:43:18.619739 [ 46468 ] {7bd60d75-ebcb-45d2-874d-260df9a4ddac} <Error> virtual DB::CompressedWriteBuffer::~CompressedWriteBuffer(): Code: 241, e.displayText() = DB::Exception: Memory limit (for user) exceeded: would use 332.07 GiB (attempt to allocate chunk of 4355342 bytes), maximum: 256.00 GiB, Stack trace (when copying this message, always include the lines below):

    0. DB::Exception::Exception<>() @ 0x86f7b88 in /usr/bin/clickhouse
    ...
    4. void DB::PODArrayBase<>::resize<>(unsigned long) @ 0xe9e878d in /usr/bin/clickhouse
    5. DB::CompressedWriteBuffer::nextImpl() @ 0xe9f0296 in /usr/bin/clickhouse
    6. DB::CompressedWriteBuffer::~CompressedWriteBuffer() @ 0xe9f0415 in /usr/bin/clickhouse
    7. DB::DistributedBlockOutputStream::writeToShard() @ 0xf6bed4a in /usr/bin/clickhouse
For TCPHandler it is safe thing todo.

Otherwise *San will report [1]:

    2021.01.24 15:33:40.103996 [ 270 ] {} <Trace> BaseDaemon: Received signal -1
    2021.01.24 15:33:40.110693 [ 270 ] {} <Fatal> BaseDaemon: (version 21.2.1.5789, build id: FF421B087D1E2EAA19FA17B5AB3AE413832744E0) (from thread 48318) Terminate called for uncaught exception:
    2021.01.24 15:33:40.114845 [ 270 ] {} <Trace> BaseDaemon: Received signal 6
    2021.01.24 15:33:40.138738 [ 218027 ] {} <Fatal> BaseDaemon: ########################################
    2021.01.24 15:33:40.138838 [ 218027 ] {} <Fatal> BaseDaemon: (version 21.2.1.5789, build id: FF421B087D1E2EAA19FA17B5AB3AE413832744E0) (from thread 48318) (no query) Received signal Aborted (6)
    2021.01.24 15:33:40.138912 [ 218027 ] {} <Fatal> BaseDaemon:
    2021.01.24 15:33:40.139277 [ 218027 ] {} <Fatal> BaseDaemon: Stack trace: 0x7f185474118b 0x7f1854720859 0xaddc0cc 0x2af9fab8 0x2af9fa04 0xa91758b 0x1e418bb5 0x20725b4f 0x20725d9e 0x266b47a3 0x269772f5 0x26971847 0x7f18548f6609 0x7f185481d293
    2021.01.24 15:33:40.139637 [ 218027 ] {} <Fatal> BaseDaemon: 3. raise @ 0x4618b in /usr/lib/x86_64-linux-gnu/libc-2.31.so
    2021.01.24 15:33:40.140113 [ 218027 ] {} <Fatal> BaseDaemon: 4. abort @ 0x25859 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
    2021.01.24 15:33:40.144121 [ 218027 ] {} <Fatal> BaseDaemon: 5. ./obj-x86_64-linux-gnu/../base/daemon/BaseDaemon.cpp:0: terminate_handler() @ 0xaddc0cc in /usr/bin/clickhouse
    2021.01.24 15:33:40.151208 [ 218027 ] {} <Fatal> BaseDaemon: 6. ./obj-x86_64-linux-gnu/../contrib/libcxxabi/src/cxa_handlers.cpp:61: std::__terminate(void (*)()) @ 0x2af9fab8 in /usr/bin/clickhouse
    2021.01.24 15:33:40.153085 [ 218027 ] {} <Fatal> BaseDaemon: 7. ./obj-x86_64-linux-gnu/../contrib/libcxxabi/src/cxa_handlers.cpp:0: std::terminate() @ 0x2af9fa04 in /usr/bin/clickhouse
    2021.01.24 15:33:40.155209 [ 218027 ] {} <Fatal> BaseDaemon: 8. ? @ 0xa91758b in /usr/bin/clickhouse
    2021.01.24 15:33:40.156621 [ 218027 ] {} <Fatal> BaseDaemon: 9. ./obj-x86_64-linux-gnu/../src/IO/WriteBufferFromPocoSocket.cpp:0: DB::WriteBufferFromPocoSocket::~WriteBufferFromPocoSocket() @ 0x1e418bb5 in /usr/bin/clickhouse
    2021.01.24 15:33:40.161041 [ 218027 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:2518: DB::TCPHandler::~TCPHandler() @ 0x20725b4f in /usr/bin/clickhouse
    2021.01.24 15:33:40.164557 [ 218027 ] {} <Fatal> BaseDaemon: 11. ./obj-x86_64-linux-gnu/../src/Server/TCPHandler.h:101: DB::TCPHandler::~TCPHandler() @ 0x20725d9e in /usr/bin/clickhouse
    2021.01.24 15:33:40.165921 [ 218027 ] {} <Fatal> BaseDaemon: 12. ./obj-x86_64-linux-gnu/../contrib/poco/Foundation/include/Poco/AtomicCounter.h:314: Poco::Net::TCPServerDispatcher::run() @ 0x266b47a3 in /usr/bin/clickhouse
    2021.01.24 15:33:40.167347 [ 218027 ] {} <Fatal> BaseDaemon: 13. ./obj-x86_64-linux-gnu/../contrib/poco/Foundation/src/ThreadPool.cpp:0: Poco::PooledThread::run() @ 0x269772f5 in /usr/bin/clickhouse
    2021.01.24 15:33:40.169401 [ 218027 ] {} <Fatal> BaseDaemon: 14. ./obj-x86_64-linux-gnu/../contrib/poco/Foundation/src/Thread_POSIX.cpp:0: Poco::ThreadImpl::runnableEntry(void*) @ 0x26971847 in /usr/bin/clickhouse
    2021.01.24 15:33:40.169498 [ 218027 ] {} <Fatal> BaseDaemon: 15. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
    2021.01.24 15:33:40.169566 [ 218027 ] {} <Fatal> BaseDaemon: 16. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
    2021.01.24 15:33:41.027601 [ 218027 ] {} <Fatal> BaseDaemon: Calculated checksum of the binary: 63D7491B39260494BA0D785E1860B427. There is no information about the reference checksum.

  [1]: https://clickhouse-test-reports.s3.yandex.net/19451/1e16bd6f337985a82fbdf4eded695dc6e663af58/stress_test_(address).html#fail1

v2: Fix catching errors in WriteBufferFromPocoSocket destructor
@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 4, 2021

Rebased (to fix conflicts and include #19886)

@azat azat marked this pull request as ready for review February 10, 2021 18:36
@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 11, 2021

AST fuzzer (MSan) — Received signal 11

#20344

AST fuzzer (UBSan) — ../contrib/libcxx/include/vector:687:31: runtime error: applying non-zero offset 18446744073709551568 to null pointer

#20345

@alexey-milovidov alexey-milovidov merged commit dc3ffd3 into ClickHouse:master Feb 11, 2021
@amosbird
Copy link
Copy Markdown
Collaborator

My local build fails to start up after this PR.

2021.02.13 12:52:22.392174 [ 1499211 ] {} <Fatal> BaseDaemon: (version 21.3.1.1, build id: 4E2F3CF7C0210B16AB70B7FB87CFA04F66570550) (from thread 1499154) Terminate called for uncaught exception:
Code: 75, e.displayText() = DB::ErrnoException: Cannot write to file /proc/self/oom_score_adj, errno: 13, strerror: Permission denied, Stack trace (when copying this message, always include the lines below):

0. /tmp/gentoo/home/amos/git/ClickHouse/src/contrib/libcxx/include/exception:133: std::exception::capture() @ 0xe5e98 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/programs/local/libclickhouse-local-libd.so
1. /tmp/gentoo/home/amos/git/ClickHouse/src/contrib/libcxx/include/exception:111: std::exception::exception() @ 0x1cec32 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/contrib/poco-cmake/Foundation/lib_poco_foundationd.so
2. /tmp/gentoo/home/amos/git/ClickHouse/src/contrib/poco/Foundation/src/Exception.cpp:28: Poco::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x1c3963 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/contrib/poco-cmake/Foundation/lib_poco_founda
2021.02.13 12:52:22.392562 [ 1499211 ] {} <Trace> BaseDaemon: Received signal 6
2021.02.13 12:52:22.393367 [ 1499224 ] {} <Fatal> BaseDaemon: ########################################
2021.02.13 12:52:22.393744 [ 1499224 ] {} <Fatal> BaseDaemon: (version 21.3.1.1, build id: 4E2F3CF7C0210B16AB70B7FB87CFA04F66570550) (from thread 1499154) (no query) Received signal Aborted (6)
2021.02.13 12:52:22.393952 [ 1499224 ] {} <Fatal> BaseDaemon:
2021.02.13 12:52:22.394146 [ 1499224 ] {} <Fatal> BaseDaemon: Stack trace: 0x7f545e585cf1 0x7f545e56f537 0x7f54603b1842 0x7f545e771a52 0x7f545e771992 0x7f545fdbc37f 0x7f545ff02a04 0x7f54603b0689 0x7f54603ae96a 0x7f5460692a94 0x7f545f3db31d 0x7f5460692a40 0x7f545f40f83c 0x7f5460691156 0x285fef 0x7f545e570cca 0x285c4a
2021.02.13 12:52:22.394482 [ 1499224 ] {} <Fatal> BaseDaemon: 4. gsignal @ 0x38cf1 in /tmp/gentoo/lib64/libc-2.31.so
2021.02.13 12:52:22.394680 [ 1499224 ] {} <Fatal> BaseDaemon: 5. abort @ 0x22537 in /tmp/gentoo/lib64/libc-2.31.so
2021.02.13 12:52:22.709132 [ 1499224 ] {} <Fatal> BaseDaemon: 6. /tmp/gentoo/home/amos/git/ClickHouse/src/base/daemon/BaseDaemon.cpp:433: terminate_handler() @ 0x98842 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/base/daemon/libdaemond.so
2021.02.13 12:52:22.713202 [ 1499224 ] {} <Fatal> BaseDaemon: 7. /tmp/gentoo/home/amos/git/ClickHouse/src/contrib/libcxxabi/src/cxa_handlers.cpp:59: std::__terminate(void (*)()) @ 0x55a52 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/contrib/libcxxabi-cmake/libcxxabid.so
2021.02.13 12:52:22.717054 [ 1499224 ] {} <Fatal> BaseDaemon: 8. /tmp/gentoo/home/amos/git/ClickHouse/src/contrib/libcxxabi/src/cxa_handlers.cpp:89: std::terminate() @ 0x55992 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/contrib/libcxxabi-cmake/libcxxabid.so
2021.02.13 12:52:22.748560 [ 1499224 ] {} <Fatal> BaseDaemon: 9. ? @ 0x22037f in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/src/libclickhouse_common_iod.so
2021.02.13 12:52:22.782618 [ 1499224 ] {} <Fatal> BaseDaemon: 10. /tmp/gentoo/home/amos/git/ClickHouse/src/src/IO/WriteBufferFromFile.cpp:87: ? @ 0x366a04 in /tmp/gentoo/home/amos/git/ClickHouse/build-dbg/src/libclickhouse_common_iod.so
2021.02.13 12:52:23.153415 [ 1499224 ] {} <Fatal> BaseDaemon: 11. /tmp/gentoo/home/amos/git/ClickHouse/src/base/daemon/BaseDaemon.cpp:565:   <----

https://github.com/ClickHouse/ClickHouse/blob/master/base/daemon/BaseDaemon.cpp#L565

@azat
Copy link
Copy Markdown
Member Author

azat commented Feb 13, 2021

My local build fails to start up after this PR.

@amosbird #20465

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants