Persistent nukeeper snapshot storage by alesapin · Pull Request #21425 · ClickHouse/ClickHouse

alesapin · 2021-03-03T16:29:10Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

Not for changelog (changelog entry is not required)

alesapin · 2021-03-05T12:30:10Z

Shutdown bug :(

alesapin · 2021-03-05T16:00:02Z

Seems like after query:

SELECT                                                                                                                                                                                                                                         
    [257],                                                                                                                                                                                                                                     
    k,                                                                                                                                                                                                                                         
    [NULL, NULL, 2., 100.0000991821289, NULL, NULL, NULL],                                                                                                                                                                                     
    9223372036854775807,                                                                                                                                                                                                                       
    [NULL, 1., 0, NULL, NULL, NULL],                                                                                                                                                                                                           
    256,                                                                                                                                                                                                                                       
    groupArrayForEach(arr)                                                                                                                                                                                                                     
FROM                                                                                                                                                                                                                                           
(                                                                                                                                                                                                                                              
    SELECT                                                                                                                                                                                                                                     
        number % NULL AS k,                                                                                                                                                                                                                    
        [-2, 1024],                                                                                                                                                                                                                            
        [-1, 1048576],                                                                                                                                                                                                                         
        range(number) AS arr                                                                                                                                                                                                                   
    FROM system.numbers                                                                                                                                                                                                                        
    LIMIT 1023                                                                                                                                                                                                                                 
)                                                                                                                                                                                                                                              
GROUP BY                                                                                                                                                                                                                                       
    arrayJoin([NULL, 100]),
    [1048575, NULL, NULL, NULL, NULL, NULL],
    k
ORDER BY arrayJoin([1048575]) ASC

OOM kills the server.
2021.03.05 18:53:50.435299 [ 268 ] {698bb0eb-77c6-44b8-a3b5-df429a8e0793} MemoryTracker: Current memory usage (total): 37.00 GiB.

alesapin · 2021-03-06T13:26:30Z

Hmmm, looks like a bug.... But I don't understand where and cannot reproduce it. For some reason when the follower node down leader tries to send a 0-level snapshot (fake) with each append_entries call. Possibly it's OK, but 100% shouldn't affect append_entries performance.
Also, for some reason kazoo client doesn't send any heartbeats for 30 seconds:

2021.03.06 11:33:15.147396 [ 13 ] {} <Information> NuKeeperTCPHandler: Received session ID 1
2021.03.06 11:33:45.334357 [ 13 ] {} <Debug> NuKeeperTCPHandler: Session #1 expired

alesapin · 2021-03-06T13:30:43Z

I've added sleeps and really, kazoo just ignore heart beats:

2021.03.06 16:28:44.745204 [ 13 ] {} <Trace> NuKeeperTCPHandlerFactory: NuKeeper request. Address: 192.168.144.1:35340
2021.03.06 16:28:44.745912 [ 13 ] {} <Information> NuKeeperTCPHandler: Requesting session ID for the new client
2021.03.06 16:28:44.750093 [ 13 ] {} <Information> NuKeeperTCPHandler: Received session ID 1
2021.03.06 16:29:15.164276 [ 13 ] {} <Debug> NuKeeperTCPHandler: Session #1 expired

alesapin · 2021-03-06T13:54:28Z

Wow, seems like our zookeeper client sends redundant heartbeats. Each operation in ZK considered as a heartbeat and we don't need to send them by timeout if have other ops (https://github.com/apache/zookeeper/blob/master/zookeeper-server/src/main/java/org/apache/zookeeper/server/ZooKeeperServer.java#L1150). Need to change NuKeeper server logic!

alesapin · 2021-03-07T07:21:32Z

test_materialize_mysql_database
Not related

alesapin · 2021-03-07T14:01:05Z

Stress test:

2021.03.07 12:53:55.398258 [ 296 ] {} <Debug> Application: Received termination signal.
2021.03.07 12:53:55.398366 [ 296 ] {} <Debug> Application: Waiting for current connections to close.
<...>
2021.03.07 12:54:01.288076 [ 296 ] {} <Information> Application: Closed connections. But 9 remain. Tip: To increase wait time add to config: <shutdown_wait_unfinished>60</shutdown_wait_unfinished>
2021.03.07 12:54:01.288381 [ 296 ] {} <Information> Application: Will shutdown forcefully.

This is a separate problem with hanging queries, but after that, we cannot start the server because of:

2021.03.07 12:56:30.780362 [ 937894 ] {} <Error> Application: Caught exception while loading metadata: Code: 999, e.displayText() = Coordination::Exception: No node, path: /clickhouse/tables/test_01513/sequence_consistency/replicas/1/metadata: Cannot attach table `test_25`.`quorum1` from metadata file /var/lib/clickhouse/metadata/test_25/quorum1.sql from query ATTACH TABLE test_25.quorum1 (`x` UInt32, `y` Date) ENGINE = ReplicatedMergeTree('/clickhouse/tables/test_01513/sequence_consistency', '1') PARTITION BY y ORDER BY x SETTINGS index_granularity = 8192: while loading database `test_25` from path /var/lib/clickhouse/metadata/test_25, Stack trace (when copying this message, always include the lines below):

alesapin · 2021-03-07T14:34:13Z

This replica was created for 5 minutes before the shutdown.

2021.03.07 12:50:50.216462 [ 466338 ] {85c2dd98-5b7d-4cee-a65b-36a7a777f362} <Trace> ContextAccess (default): Access granted: CREATE TABLE ON test_25.quorum1                                                                                  
2021.03.07 12:50:50.243636 [ 466338 ] {85c2dd98-5b7d-4cee-a65b-36a7a777f362} <Debug> test_25.quorum1: Loading data parts                                                                                                                       
2021.03.07 12:50:50.244103 [ 466338 ] {85c2dd98-5b7d-4cee-a65b-36a7a777f362} <Debug> test_25.quorum1: Loaded data parts (0 items)                                                                                                              
2021.03.07 12:50:50.245940 [ 466338 ] {85c2dd98-5b7d-4cee-a65b-36a7a777f362} <Debug> test_25.quorum1: This table /clickhouse/tables/test_01513/sequence_consistency is already created, will add new replica                                   
2021.03.07 12:50:50.247979 [ 466338 ] {85c2dd98-5b7d-4cee-a65b-36a7a777f362} <Debug> test_25.quorum1: Creating replica /clickhouse/tables/test_01513/sequence_consistency/replicas/1                                                           
2021.03.07 12:50:50.268746 [ 372 ] {} <Debug> test_25.quorum1 (ReplicatedMergeTreeRestartingThread): Activating replica.
2021.03.07 12:50:50.276419 [ 372 ] {} <Information> test_25.quorum1: Replica 3 has log pointer '0', approximate 1 queue lag and 0 queue size
2021.03.07 12:50:50.276577 [ 372 ] {} <Information> test_25.quorum1: Will mimic 3
2021.03.07 12:50:50.281992 [ 372 ] {} <Debug> test_25.quorum1: Queued 0 parts to be fetched
2021.03.07 12:50:50.282109 [ 372 ] {} <Debug> test_25.quorum1: Copied 0 queue entries
2021.03.07 12:50:50.282905 [ 372 ] {} <Debug> test_25.quorum1 (ReplicatedMergeTreeQueue): Loading queue from /clickhouse/tables/test_01513/sequence_consistency/replicas/1/queue
2021.03.07 12:50:50.283877 [ 372 ] {} <Debug> test_25.quorum1 (ReplicatedMergeTreeQueue): Having 0 queue entries to load, 0 entries already loaded.
2021.03.07 12:50:50.284342 [ 372 ] {} <Trace> test_25.quorum1 (ReplicatedMergeTreeQueue): Loaded queue
2021.03.07 12:50:50.292629 [ 383 ] {} <Information> test_25.quorum1: Became leader
2021.03.07 12:50:53.670068 [ 466001 ] {f01dffd2-5ecb-4790-acc9-035c69506d30} <Trace> ContextAccess (default): Access granted: SYSTEM MERGES ON test_25.quorum1
2021.03.07 12:50:56.832574 [ 466001 ] {64c769cc-cd12-4c49-88d2-1ffaf8cbec60} <Trace> ContextAccess (default): Access granted: SYSTEM MERGES ON test_25.quorum1
2021.03.07 12:51:19.932018 [ 466000 ] {b66073f5-b290-41a9-babf-6bb2223d6fc1} <Trace> ContextAccess (default): Access granted: SYSTEM MERGES ON test_25.quorum1
<...>
2021.03.07 12:54:01.288381 [ 296 ] {} <Information> Application: Will shutdown forcefully.

alesapin · 2021-03-07T14:40:50Z

No idea how is it possible. I'll upload of coordination dump after stress test.

alesapin · 2021-03-07T19:02:52Z

I have some bug with snapshots:

2021.03.07 21:45:48.326377 [ 72262 ] {} <Fatal> BaseDaemon: ########################################
2021.03.07 21:45:48.326433 [ 72262 ] {} <Fatal> BaseDaemon: (version 21.4.1.6189, build id: 9CA33DF56446D8AFAA017C202F16256573DB58C1) (from thread 72184) (no query) Received signal Unknown signal (-3)
2021.03.07 21:45:48.326458 [ 72262 ] {} <Fatal> BaseDaemon: Sanitizer trap.
2021.03.07 21:45:48.326547 [ 72262 ] {} <Fatal> BaseDaemon: Stack trace: 0xdc01d35 0xde2e996 0xdbc60d6 0xdbd18ff 0x1a9fb7b3 0x1a9f87f7 0x1a9f5965 0x1a9f7bb0 0x1a9e703a 0x1a9c7fde 0x1aa241e6 0x19198656 0xdc7d2c3 0x1b4d8aba 0xdc71722 0x1b4eabdb 0xdc6fb3b 0xdbd7c14 0x7f77ff2df0b3 0xdbb5d2e
2021.03.07 21:45:48.344817 [ 72262 ] {} <Fatal> BaseDaemon: 0.1. inlined from ./obj-x86_64-linux-gnu/../src/Common/StackTrace.cpp:298: StackTrace::tryCapture()
2021.03.07 21:45:48.344847 [ 72262 ] {} <Fatal> BaseDaemon: 0. ../src/Common/StackTrace.cpp:259: StackTrace::StackTrace() @ 0xdc01d35 in /usr/bin/clickhouse
2021.03.07 21:45:48.401057 [ 72262 ] {} <Fatal> BaseDaemon: 1. ./obj-x86_64-linux-gnu/../base/daemon/BaseDaemon.cpp:381: sanitizerDeathCallback() @ 0xde2e996 in /usr/bin/clickhouse
2021.03.07 21:45:49.290881 [ 72262 ] {} <Fatal> BaseDaemon: 2. __sanitizer::Die() @ 0xdbc60d6 in /usr/bin/clickhouse
2021.03.07 21:45:50.166361 [ 72262 ] {} <Fatal> BaseDaemon: 3. ? @ 0xdbd18ff in /usr/bin/clickhouse
2021.03.07 21:45:50.203446 [ 72262 ] {} <Fatal> BaseDaemon: 4. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/__hash_table:333: std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<StringRef, std::__1::__list_iterator<DB::ListNode<DB::NuKeeperStorage::Node>, void*> >, void*>*>::operator->() const @ 0x1a9fb7b3 in /usr/bin/clickhouse
2021.03.07 21:45:50.238919 [ 72262 ] {} <Fatal> BaseDaemon: 5.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/unordered_map:833: std::__1::__hash_map_iterator<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<StringRef, std::__1::__list_iterator<DB::ListNode<DB::NuKeeperStorage::Node>, void*> >, void*>*> >::operator->() const
2021.03.07 21:45:50.238957 [ 72262 ] {} <Fatal> BaseDaemon: 5. ../src/Coordination/SnapshotableHashTable.h:124: DB::SnapshotableHashTable<DB::NuKeeperStorage::Node>::updateValue(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (DB::NuKeeperStorage::Node&)>) @ 0x1a9f87f7 in /usr/bin/clickhouse
2021.03.07 21:45:50.270801 [ 72262 ] {} <Fatal> BaseDaemon: 6.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2547: ~function
2021.03.07 21:45:50.270825 [ 72262 ] {} <Fatal> BaseDaemon: 6. ../src/Coordination/NuKeeperSnapshotManager.cpp:197: DB::NuKeeperStorageSnapshot::deserialize(DB::NuKeeperStorage&, DB::ReadBuffer&) @ 0x1a9f5965 in /usr/bin/clickhouse
2021.03.07 21:45:50.307055 [ 72262 ] {} <Fatal> BaseDaemon: 7. ./obj-x86_64-linux-gnu/../src/Coordination/NuKeeperSnapshotManager.cpp:0: DB::NuKeeperSnapshotManager::deserializeSnapshotFromBuffer(DB::NuKeeperStorage*, std::__1::shared_ptr<nuraft::buffer>) @ 0x1a9f7bb0 in /usr/bin/clickhouse
2021.03.07 21:45:50.342970 [ 72262 ] {} <Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../src/Coordination/NuKeeperStateMachine.cpp:0: DB::NuKeeperStateMachine::init() @ 0x1a9e703a in /usr/bin/clickhouse
2021.03.07 21:45:50.377669 [ 72262 ] {} <Fatal> BaseDaemon: 9. ./obj-x86_64-linux-gnu/../src/Coordination/NuKeeperServer.cpp:40: DB::NuKeeperServer::startup() @ 0x1a9c7fde in /usr/bin/clickhouse
2021.03.07 21:45:50.411125 [ 72262 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../src/Coordination/NuKeeperStorageDispatcher.cpp:140: DB::NuKeeperStorageDispatcher::initialize(Poco::Util::AbstractConfiguration const&) @ 0x1aa241e6 in /usr/bin/clickhouse
2021.03.07 21:45:50.581705 [ 72262 ] {} <Fatal> BaseDaemon: 11.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/__mutex_base:97: ~lock_guard
2021.03.07 21:45:50.581751 [ 72262 ] {} <Fatal> BaseDaemon: 11. ../src/Interpreters/Context.cpp:1605: DB::Context::initializeNuKeeperStorageDispatcher() const @ 0x19198656 in /usr/bin/clickhouse
2021.03.07 21:45:50.675764 [ 72262 ] {} <Fatal> BaseDaemon: 12.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/vector:1517: std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > >::begin()
2021.03.07 21:45:50.675791 [ 72262 ] {} <Fatal> BaseDaemon: 12. ../programs/server/Server.cpp:865: DB::Server::main(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) @ 0xdc7d2c3 in /usr/bin/clickhouse
2021.03.07 21:45:50.694075 [ 72262 ] {} <Fatal> BaseDaemon: 13. ./obj-x86_64-linux-gnu/../contrib/poco/Util/src/Application.cpp:0: Poco::Util::Application::run() @ 0x1b4d8aba in /usr/bin/clickhouse
2021.03.07 21:45:50.807178 [ 72262 ] {} <Fatal> BaseDaemon: 14. ./obj-x86_64-linux-gnu/../programs/server/Server.cpp:343: DB::Server::run() @ 0xdc71722 in /usr/bin/clickhouse
2021.03.07 21:45:50.815062 [ 72262 ] {} <Fatal> BaseDaemon: 15. ./obj-x86_64-linux-gnu/../contrib/poco/Util/src/ServerApplication.cpp:612: Poco::Util::ServerApplication::run(int, char**) @ 0x1b4eabdb in /usr/bin/clickhouse
2021.03.07 21:45:50.927598 [ 72262 ] {} <Fatal> BaseDaemon: 16. ./obj-x86_64-linux-gnu/../programs/server/Server.cpp:0: mainEntryClickHouseServer(int, char**) @ 0xdc6fb3b in /usr/bin/clickhouse
2021.03.07 21:45:50.931362 [ 72262 ] {} <Fatal> BaseDaemon: 17. ./obj-x86_64-linux-gnu/../programs/main.cpp:0: main @ 0xdbd7c14 in /usr/bin/clickhouse
2021.03.07 21:45:50.931402 [ 72262 ] {} <Fatal> BaseDaemon: 18. __libc_start_main @ 0x270b3 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.03.07 21:45:51.804285 [ 72262 ] {} <Fatal> BaseDaemon: 19. _start @ 0xdbb5d2e in /usr/bin/clickhouse
2021.03.07 21:45:52.015648 [ 72262 ] {} <Fatal> BaseDaemon: Calculated checksum of the binary: F0DE70ADD84C5628E61189C36A588595. There is no information about the reference checksum.

Child node exist without parent.

alesapin · 2021-03-08T19:29:01Z

Ok, will make one more CI run just to be sure.

alesapin · 2021-03-09T07:13:17Z

test_testkeeper_restore_from_snapshot/test.py::test_recover_from_snapshot flaky -- with tsan server start 7 seconds but we wait for 5 only.

alesapin · 2021-03-09T15:35:36Z

No related failures. Let's merge and see what will happen.
fuzzer:

$ grep -a 'Fatal' server.log
2021.03.09 12:48:09.071765 [ 82 ] {} <Fatal> Application: Child process was terminated by signal 6.

And that is all....

alesapin added 30 commits February 25, 2021 11:34

Get rid of own zxid

f8b81ee

Merge branch 'master' into persistent_nukeeper_snapshot_storage

c4a6289

Experimenting

cf42824

Experimenting

356e548

Stl containers

15face7

Move storage to new map

8130726

Fix build

12f74d4

More fixes

83e2f1c

Merge branch 'master' into persistent_nukeeper_snapshot_storage

91bc447

Some compileable code

03960b1

Test simple serde

73d3c20

Max snapshots on disk

b7b6fd7

Fix bug

321dd3e

Test for snapshot removal

40c7455

Merge branch 'master' into persistent_nukeeper_snapshot_storage

e7399bf

Fix nasty bug

08ccae3

Even better tests

48bf55a

Written synchronous snapshots

10e16e3

Use snapshot object in serialization

8e6252b

Deepbugging

346d2b1

Start from committed plus 1

ee185bc

Merge branch 'master' into persistent_nukeeper_snapshot_storage

07e39ef

Fix some bugs and add snapshots validation

e615299

Rename file

1707e7f

Fix nasty serialization bug

37fc5fa

Get rid of unused field

0136d7d

Followup fix

186b39f

Better test

869818d

Fix log store start

7e28bfe

Better initial batch callback

e77f7e5

alesapin added the do not test disable testing on pull request label Mar 5, 2021

Fix shutdown

1bfc27a

alesapin removed the do not test disable testing on pull request label Mar 5, 2021

Merge branch 'master' into persistent_nukeeper_snapshot_storage

652c6cd

More easy test

00c0d75

Fix session timeout update

a8b7469

alesapin marked this pull request as ready for review March 6, 2021 20:15

alesapin added 2 commits March 7, 2021 10:30

Remove redundant methods

0ee7425

Merge branch 'master' into persistent_nukeeper_snapshot_storage

ef192ec

alesapin changed the title ~~[WIP] Persistent nukeeper snapshot storage~~ Persistent nukeeper snapshot storage Mar 7, 2021

Upload coordination log after stress test

b2c12f3

alesapin added 2 commits March 8, 2021 00:40

Blind fix

fb264c1

Fix one more nasty bug

94248dd

Merge branch 'master' into persistent_nukeeper_snapshot_storage

f555e65

More tries to wait server start

dcc6c9d

alesapin merged commit bcd0732 into master Mar 9, 2021

alesapin deleted the persistent_nukeeper_snapshot_storage branch March 9, 2021 15:37

Conversation

alesapin commented Mar 3, 2021

Uh oh!

alesapin commented Mar 5, 2021

Uh oh!

alesapin commented Mar 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Mar 6, 2021

Uh oh!

alesapin commented Mar 6, 2021

Uh oh!

alesapin commented Mar 6, 2021

Uh oh!

alesapin commented Mar 7, 2021

Uh oh!

alesapin commented Mar 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Mar 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Mar 7, 2021

Uh oh!

alesapin commented Mar 7, 2021

Uh oh!

alesapin commented Mar 8, 2021

Uh oh!

alesapin commented Mar 9, 2021

Uh oh!

alesapin commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alesapin commented Mar 5, 2021 •

edited

Loading

alesapin commented Mar 7, 2021 •

edited

Loading

alesapin commented Mar 7, 2021 •

edited

Loading

alesapin commented Mar 9, 2021 •

edited

Loading