Skip to content

Jepsen for nukeeper#21677

Merged
alesapin merged 48 commits intomasterfrom
jepsen_for_nukeeper
Mar 27, 2021
Merged

Jepsen for nukeeper#21677
alesapin merged 48 commits intomasterfrom
jepsen_for_nukeeper

Conversation

@alesapin
Copy link
Copy Markdown
Member

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add jepsen tests for NuKeeper.

sorry for my clojure skillz.....

Add some java magic

Allow to connect with old session id

More angry nemesis and fixes

Angry

Fix style

Split to files

Better wrappers

Better structure

Add set test and split to separate files (I think something broken now)

Better

Missed files
@robot-clickhouse robot-clickhouse added the pr-build Pull request with build/testing/packaging improvement label Mar 12, 2021
@alesapin alesapin added the do not test disable testing on pull request label Mar 12, 2021
@alesapin alesapin marked this pull request as draft March 12, 2021 19:09
:write (try
(do (zk-set conn zk-k v)
(assoc op :type :ok))
(catch Exception _ (assoc op :type :info, :error :connect-error)))
Copy link
Copy Markdown
Member Author

@alesapin alesapin Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny, but :info error means "We don't know operation happened or not", :error strictly means that operation didn't happen.

@@ -0,0 +1,13 @@
(defproject jepsen.nukeeper "0.1.0-SNAPSHOT"
:injections [(.. System (setProperty "zookeeper.request.timeout" "10000"))]
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only way to specify operation timeout for zk client.

@alesapin
Copy link
Copy Markdown
Member Author

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Mar 17, 2021

Currently have workloads:

  • CAS register workload -- read/write/compare-and-swap on a single node. jepsen validates that history was linearizable. The heaviest test.
  • Set workload on a single node -- store string representation of Set in a single node. Trying to add new elements: read node + deserialize set + add a new number to set + serialize set to string + write to the node with version check. Validate that all successfully added elements exist in set and nothing was lost.
  • Unique-IDs workload through sequential nodes -- generate unique numbers creating sequential nodes. Validates that all successfully returned numbers are unique.
  • Counter workload through multiple sequential nodes -- increment counter by N and read current counter value. Implemented via multitransaction with create of N sequential nodes and getChildren to get current counter value.
  • Queue workload -- this most complex one. Jepsen generates ordered string nodes, we enqueue them creating nodes in "/path". Also, Jepsen generate dequeue op: we list all nodes "/" getting version of "/" + delete first node in multitransaction -- check "/" version + set empty value to "/" node to increment version + delete first node. In the last operation, we drain the queue deleting all existing nodes, and Jepsen checks that nothing was lost. This workload found a bug with snapshot restore.

Currently have nemesises:

  • Single network partition + sleep + network restore
  • Single node kill -9 + sleep 5 + server start
  • Single node kill -SIGSTOP + sleep 5 + SIGCONT
  • Single node corrupt last log file + restart server
  • Single node corrupt last snapshot + restart server
  • Single node corrupt both last log and last snapshot + restart
  • Single node remove all coordination data + restart

What to add:

  • One more workload with multitransaction (maybe also Set, but with exists + create node transaction)
  • Try more sophisticated network failure nemesis
  • All nodes killer and all nodes SIGSTOP
  • --Ability to run with docker-- (don't want to do it)
  • --Docs-- (moved to the next todo list)
  • Read checker.clj to find more interesting workloads (something to validate getChildren request and ephemeral nodes)
  • Get rid of tons of messages for java zookeeper client (log4j configuration).


SnapshotMetadataPtr NuKeeperStorageSnapshot::deserialize(NuKeeperStorage & storage, ReadBuffer & in)
{
storage.clearData();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught by Jepsen. Need to fix the interface.

@alesapin
Copy link
Copy Markdown
Member Author

alesapin commented Mar 18, 2021

queue test is awesome. Found bug with snapshots apply, and something more interesting (currently not investigated):

INFO [2021-03-18 23:50:16,269] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :workload {:ok-count 696,
            :duplicated-count 0,
            :valid? false,
            :lost-count 2,
            :lost #{"1102" "148"},
            :acknowledged-count 698,
            :recovered #{},
            :attempt-count 718,
            :unexpected #{},
            :unexpected-count 0,
            :recovered-count 0, 
            :duplicated #{}},
 :valid? false}

Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ 

@alesapin
Copy link
Copy Markdown
Member Author

Wow, something new

INFO [2021-03-23 17:36:27,930] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :workload {:ok-count 156,
            :valid? false,
            :lost-count 3,
            :lost "#{231 233 242}",
            :acknowledged-count 159,
            :recovered "#{}",
            :ok "#{0..1 4 8 12 14 18..20 23..24 30 32 42 45..46 48 55 60 67 72 77 80 82 84 86..87 89 97..98 101 106 109 111 118 122..124 126 128 130 132 134 136 138 142 146 148..149 151 153..154 156 159 163..164 167 171 173 176 178..179 182 185..186 188 191 193 196 198 201..205 209..214 216..217 219 222..223 227 229 287 305 308 345 358 362 369 375 378 386 396 405 417 425 434 445 454 462 470 480 491 498 505 516 526 534 542 552 563 571 578 587 598 607 615 670..671 674 685 704 718 734 749 765 785 807 821 841 858 874 887 907 929 943 963 979 1000 1024..1030 1032 1035 1037 1039}",
            :attempt-count 1041,
            :unexpected "#{}",
            :unexpected-count 0,
            :recovered-count 0},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

@alesapin
Copy link
Copy Markdown
Member Author

2021.03.23 18:35:24.119499 [ 30503 ] {} <Fatal> Application: Child process was terminated by signal 11.
2021.03.23 18:35:32.056719 [ 30743 ] {} <Fatal> BaseDaemon: ########################################
2021.03.23 18:35:32.056885 [ 30743 ] {} <Fatal> BaseDaemon: (version 21.4.1.6321, build id: 275B1BB4F3D101935B90D191AC51FFE2E670361F) (from thread 30740) (no query) Received signal Segmentation fault (11)
2021.03.23 18:35:32.057053 [ 30743 ] {} <Fatal> BaseDaemon: Address: 0x20 Access: read. Address not mapped to object.
2021.03.23 18:35:32.057242 [ 30743 ] {} <Fatal> BaseDaemon: Stack trace: 0x138e6f06 0x138fe3b0 0x138daf46 0x138dee13 0x160ad2dc 0x160ac7f0 0x160abc7b 0x160a997a 0x8c53eed 0x7eff18bea609 0x7eff18b11293
2021.03.23 18:35:32.426708 [ 30743 ] {} <Fatal> BaseDaemon: 5. ./obj-x86_64-linux-gnu/../src/Coordination/SnapshotableHashTable.h:0: DB::SnapshotableHashTable<DB::NuKeeperStorage::Node>::updateValue(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (DB::NuKeeperStorage::Node&)>) @ 0x138e6f06 in /usr/bin/clickhouse
2021.03.23 18:35:32.848127 [ 30743 ] {} <Fatal> BaseDaemon: 6.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2191: ~__policy_func
2021.03.23 18:35:32.848282 [ 30743 ] {} <Fatal> BaseDaemon: 6.2. inlined from ../contrib/libcxx/include/functional:2547: ~function
2021.03.23 18:35:32.848368 [ 30743 ] {} <Fatal> BaseDaemon: 6. ../src/Coordination/NuKeeperStorage.cpp:644: DB::NuKeeperStorage::processRequest(std::__1::shared_ptr<Coordination::ZooKeeperRequest> const&, long, std::__1::optional<long>) @ 0x138fe3b0 in /usr/bin/clickhouse
2021.03.23 18:35:33.194269 [ 30743 ] {} <Fatal> BaseDaemon: 7. ./obj-x86_64-linux-gnu/../src/Coordination/NuKeeperStateMachine.cpp:0: DB::NuKeeperStateMachine::commit(unsigned long, nuraft::buffer&) @ 0x138daf46 in /usr/bin/clickhouse
2021.03.23 18:35:33.553673 [ 30743 ] {} <Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../contrib/NuRaft/include/libnuraft/state_machine.hxx:0: nuraft::state_machine::commit_ext(nuraft::state_machine::ext_op_params const&) @ 0x138dee13 in /usr/bin/clickhouse
2021.03.23 18:35:33.833943 [ 30743 ] {} <Fatal> BaseDaemon: 9.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3935: std::__1::enable_if<(is_move_constructible<nuraft::buffer*>::value) && (is_move_assignable<nuraft::buffer*>::value), void>::type std::__1::swap<nuraft::buffer*>(nuraft::buffer*&, nuraft::buffer*&)
2021.03.23 18:35:33.834155 [ 30743 ] {} <Fatal> BaseDaemon: 9.2. inlined from ../contrib/libcxx/include/memory:3299: std::__1::shared_ptr<nuraft::buffer>::swap(std::__1::shared_ptr<nuraft::buffer>&)
2021.03.23 18:35:33.834291 [ 30743 ] {} <Fatal> BaseDaemon: 9.3. inlined from ../contrib/libcxx/include/memory:3243: std::__1::shared_ptr<nuraft::buffer>::operator=(std::__1::shared_ptr<nuraft::buffer>&&)
2021.03.23 18:35:33.834377 [ 30743 ] {} <Fatal> BaseDaemon: 9. ../contrib/NuRaft/src/handle_commit.cxx:283: nuraft::raft_server::commit_app_log(unsigned long, std::__1::shared_ptr<nuraft::log_entry>&, bool) @ 0x160ad2dc in /usr/bin/clickhouse
2021.03.23 18:35:34.104006 [ 30743 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../contrib/NuRaft/src/handle_commit.cxx:214: nuraft::raft_server::commit_in_bg_exec(unsigned long) @ 0x160ac7f0 in /usr/bin/clickhouse
2021.03.23 18:35:34.367291 [ 30743 ] {} <Fatal> BaseDaemon: 11. ./obj-x86_64-linux-gnu/../contrib/NuRaft/src/handle_commit.cxx:0: nuraft::raft_server::commit_in_bg() @ 0x160abc7b in /usr/bin/clickhouse
2021.03.23 18:35:34.830652 [ 30743 ] {} <Fatal> BaseDaemon: 12.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:1655: std::__1::unique_ptr<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> >, std::__1::default_delete<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> > > >::reset(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> >*)
2021.03.23 18:35:34.830854 [ 30743 ] {} <Fatal> BaseDaemon: 12.2. inlined from ../contrib/libcxx/include/memory:1612: ~unique_ptr
2021.03.23 18:35:34.830946 [ 30743 ] {} <Fatal> BaseDaemon: 12. ../contrib/libcxx/include/thread:293: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> > >(void*) @ 0x160a997a in /usr/bin/clickhouse
2021.03.23 18:35:44.223535 [ 30743 ] {} <Fatal> BaseDaemon: 13. __tsan_thread_start_func @ 0x8c53eed in /usr/bin/clickhouse
2021.03.23 18:35:44.223743 [ 30743 ] {} <Fatal> BaseDaemon: 14. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
2021.03.23 18:35:44.223919 [ 30743 ] {} <Fatal> BaseDaemon: 15. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.03.23 18:35:45.454948 [ 30743 ] {} <Fatal> BaseDaemon: Calculated checksum of the binary: 340B233185B8E13144E14D5551125751. There is no information about the reference checksum.
2021.03.23 18:35:53.688012 [ 30677 ] {} <Fatal> Application: Child process was terminated by signal 11.

I've introduced some bug....

@alesapin
Copy link
Copy Markdown
Member Author

2021.03.23 18:35:24.119499 [ 30503 ] {} <Fatal> Application: Child process was terminated by signal 11.
2021.03.23 18:35:32.056719 [ 30743 ] {} <Fatal> BaseDaemon: ########################################
2021.03.23 18:35:32.056885 [ 30743 ] {} <Fatal> BaseDaemon: (version 21.4.1.6321, build id: 275B1BB4F3D101935B90D191AC51FFE2E670361F) (from thread 30740) (no query) Received signal Segmentation fault (11)
2021.03.23 18:35:32.057053 [ 30743 ] {} <Fatal> BaseDaemon: Address: 0x20 Access: read. Address not mapped to object.
2021.03.23 18:35:32.057242 [ 30743 ] {} <Fatal> BaseDaemon: Stack trace: 0x138e6f06 0x138fe3b0 0x138daf46 0x138dee13 0x160ad2dc 0x160ac7f0 0x160abc7b 0x160a997a 0x8c53eed 0x7eff18bea609 0x7eff18b11293
2021.03.23 18:35:32.426708 [ 30743 ] {} <Fatal> BaseDaemon: 5. ./obj-x86_64-linux-gnu/../src/Coordination/SnapshotableHashTable.h:0: DB::SnapshotableHashTable<DB::NuKeeperStorage::Node>::updateValue(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (DB::NuKeeperStorage::Node&)>) @ 0x138e6f06 in /usr/bin/clickhouse
2021.03.23 18:35:32.848127 [ 30743 ] {} <Fatal> BaseDaemon: 6.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2191: ~__policy_func
2021.03.23 18:35:32.848282 [ 30743 ] {} <Fatal> BaseDaemon: 6.2. inlined from ../contrib/libcxx/include/functional:2547: ~function
2021.03.23 18:35:32.848368 [ 30743 ] {} <Fatal> BaseDaemon: 6. ../src/Coordination/NuKeeperStorage.cpp:644: DB::NuKeeperStorage::processRequest(std::__1::shared_ptr<Coordination::ZooKeeperRequest> const&, long, std::__1::optional<long>) @ 0x138fe3b0 in /usr/bin/clickhouse
2021.03.23 18:35:33.194269 [ 30743 ] {} <Fatal> BaseDaemon: 7. ./obj-x86_64-linux-gnu/../src/Coordination/NuKeeperStateMachine.cpp:0: DB::NuKeeperStateMachine::commit(unsigned long, nuraft::buffer&) @ 0x138daf46 in /usr/bin/clickhouse
2021.03.23 18:35:33.553673 [ 30743 ] {} <Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../contrib/NuRaft/include/libnuraft/state_machine.hxx:0: nuraft::state_machine::commit_ext(nuraft::state_machine::ext_op_params const&) @ 0x138dee13 in /usr/bin/clickhouse
2021.03.23 18:35:33.833943 [ 30743 ] {} <Fatal> BaseDaemon: 9.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3935: std::__1::enable_if<(is_move_constructible<nuraft::buffer*>::value) && (is_move_assignable<nuraft::buffer*>::value), void>::type std::__1::swap<nuraft::buffer*>(nuraft::buffer*&, nuraft::buffer*&)
2021.03.23 18:35:33.834155 [ 30743 ] {} <Fatal> BaseDaemon: 9.2. inlined from ../contrib/libcxx/include/memory:3299: std::__1::shared_ptr<nuraft::buffer>::swap(std::__1::shared_ptr<nuraft::buffer>&)
2021.03.23 18:35:33.834291 [ 30743 ] {} <Fatal> BaseDaemon: 9.3. inlined from ../contrib/libcxx/include/memory:3243: std::__1::shared_ptr<nuraft::buffer>::operator=(std::__1::shared_ptr<nuraft::buffer>&&)
2021.03.23 18:35:33.834377 [ 30743 ] {} <Fatal> BaseDaemon: 9. ../contrib/NuRaft/src/handle_commit.cxx:283: nuraft::raft_server::commit_app_log(unsigned long, std::__1::shared_ptr<nuraft::log_entry>&, bool) @ 0x160ad2dc in /usr/bin/clickhouse
2021.03.23 18:35:34.104006 [ 30743 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../contrib/NuRaft/src/handle_commit.cxx:214: nuraft::raft_server::commit_in_bg_exec(unsigned long) @ 0x160ac7f0 in /usr/bin/clickhouse
2021.03.23 18:35:34.367291 [ 30743 ] {} <Fatal> BaseDaemon: 11. ./obj-x86_64-linux-gnu/../contrib/NuRaft/src/handle_commit.cxx:0: nuraft::raft_server::commit_in_bg() @ 0x160abc7b in /usr/bin/clickhouse
2021.03.23 18:35:34.830652 [ 30743 ] {} <Fatal> BaseDaemon: 12.1. inlined from ./obj-x86_64-linux-gnu/../contrib/libcxx/include/memory:1655: std::__1::unique_ptr<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> >, std::__1::default_delete<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> > > >::reset(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> >*)
2021.03.23 18:35:34.830854 [ 30743 ] {} <Fatal> BaseDaemon: 12.2. inlined from ../contrib/libcxx/include/memory:1612: ~unique_ptr
2021.03.23 18:35:34.830946 [ 30743 ] {} <Fatal> BaseDaemon: 12. ../contrib/libcxx/include/thread:293: void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, std::__1::__bind<void (nuraft::raft_server::*)(), nuraft::raft_server*> > >(void*) @ 0x160a997a in /usr/bin/clickhouse
2021.03.23 18:35:44.223535 [ 30743 ] {} <Fatal> BaseDaemon: 13. __tsan_thread_start_func @ 0x8c53eed in /usr/bin/clickhouse
2021.03.23 18:35:44.223743 [ 30743 ] {} <Fatal> BaseDaemon: 14. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
2021.03.23 18:35:44.223919 [ 30743 ] {} <Fatal> BaseDaemon: 15. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.03.23 18:35:45.454948 [ 30743 ] {} <Fatal> BaseDaemon: Calculated checksum of the binary: 340B233185B8E13144E14D5551125751. There is no information about the reference checksum.
2021.03.23 18:35:53.688012 [ 30677 ] {} <Fatal> Application: Child process was terminated by signal 11.

I've introduced some bug....

Fixed

@alesapin
Copy link
Copy Markdown
Member Author

Wow, something new

INFO [2021-03-23 17:36:27,930] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :workload {:ok-count 156,
            :valid? false,
            :lost-count 3,
            :lost "#{231 233 242}",
            :acknowledged-count 159,
            :recovered "#{}",
            :ok "#{0..1 4 8 12 14 18..20 23..24 30 32 42 45..46 48 55 60 67 72 77 80 82 84 86..87 89 97..98 101 106 109 111 118 122..124 126 128 130 132 134 136 138 142 146 148..149 151 153..154 156 159 163..164 167 171 173 176 178..179 182 185..186 188 191 193 196 198 201..205 209..214 216..217 219 222..223 227 229 287 305 308 345 358 362 369 375 378 386 396 405 417 425 434 445 454 462 470 480 491 498 505 516 526 534 542 552 563 571 578 587 598 607 615 670..671 674 685 704 718 734 749 765 785 807 821 841 858 874 887 907 929 943 963 979 1000 1024..1030 1032 1035 1037 1039}",
            :attempt-count 1041,
            :unexpected "#{}",
            :unexpected-count 0,
            :recovered-count 0},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

This happens when rollback happened:

2021.03.23 17:35:31.305508 [ 1274 ] {} <Information> RaftInstance: rollback logs: 517 - 540, commit idx req 515, quick 539, sm 539, num log entries 1, current count 0
2021.03.23 17:35:31.305515 [ 1274 ] {} <Warning> RaftInstance: rollback quick commit index from 539 to 516
2021.03.23 17:35:31.305529 [ 1274 ] {} <Warning> RaftInstance: rollback sm commit index from 539 to 516
2021.03.23 17:35:31.305540 [ 1274 ] {} <Information> RaftInstance: rollback log 540
2021.03.23 17:35:31.305547 [ 1274 ] {} <Information> RaftInstance: rollback log 539
2021.03.23 17:35:31.305553 [ 1274 ] {} <Information> RaftInstance: rollback log 538
2021.03.23 17:35:31.305559 [ 1274 ] {} <Information> RaftInstance: rollback log 537
2021.03.23 17:35:31.305564 [ 1274 ] {} <Information> RaftInstance: rollback log 536
2021.03.23 17:35:31.305571 [ 1274 ] {} <Information> RaftInstance: rollback log 535
2021.03.23 17:35:31.305576 [ 1274 ] {} <Information> RaftInstance: rollback log 534
2021.03.23 17:35:31.305582 [ 1274 ] {} <Information> RaftInstance: rollback log 533
2021.03.23 17:35:31.305587 [ 1274 ] {} <Information> RaftInstance: rollback log 532
2021.03.23 17:35:31.305593 [ 1274 ] {} <Information> RaftInstance: rollback log 531
2021.03.23 17:35:31.305599 [ 1274 ] {} <Information> RaftInstance: rollback log 530
2021.03.23 17:35:31.305604 [ 1274 ] {} <Information> RaftInstance: rollback log 529
2021.03.23 17:35:31.305618 [ 1274 ] {} <Information> RaftInstance: rollback log 528
2021.03.23 17:35:31.305624 [ 1274 ] {} <Information> RaftInstance: rollback log 527
2021.03.23 17:35:31.305630 [ 1274 ] {} <Information> RaftInstance: rollback log 526
2021.03.23 17:35:31.305636 [ 1274 ] {} <Information> RaftInstance: rollback log 525
2021.03.23 17:35:31.305642 [ 1274 ] {} <Information> RaftInstance: rollback log 524
2021.03.23 17:35:31.305648 [ 1274 ] {} <Information> RaftInstance: rollback log 523
2021.03.23 17:35:31.305654 [ 1274 ] {} <Information> RaftInstance: rollback log 522
2021.03.23 17:35:31.305660 [ 1274 ] {} <Information> RaftInstance: rollback log 521
2021.03.23 17:35:31.305666 [ 1274 ] {} <Information> RaftInstance: rollback log 520
2021.03.23 17:35:31.305671 [ 1274 ] {} <Information> RaftInstance: rollback log 519
2021.03.23 17:35:31.305677 [ 1274 ] {} <Information> RaftInstance: rollback log 518
2021.03.23 17:35:31.305682 [ 1274 ] {} <Information> RaftInstance: rollback log 517
2021.03.23 17:35:31.305740 [ 1274 ] {} <Information> RaftInstance: overwrite at 517

@robot-clickhouse robot-clickhouse added the submodule changed At least one submodule changed in this PR. label Mar 25, 2021
@alesapin alesapin marked this pull request as ready for review March 25, 2021 10:24
@alesapin alesapin changed the title [WIP] Jepsen for nukeeper Jepsen for nukeeper Mar 25, 2021
@alesapin
Copy link
Copy Markdown
Member Author

Ok, waiting for tests, and I'll merge this. Several useful bugfixes here.

@alesapin
Copy link
Copy Markdown
Member Author

2021.03.25 14:53:51.650180 [ 295 ] {} Application: Child process was terminated by signal 9 (KILL). If it is not done by 'forcestop' command or manually, the possible cause is OOM Killer (see 'dmesg' and look at the '/var/log/kern.log' for the details).

@alesapin
Copy link
Copy Markdown
Member Author

All failures are known flaky.

@alesapin alesapin merged commit ced6d8e into master Mar 27, 2021
@alesapin alesapin deleted the jepsen_for_nukeeper branch March 27, 2021 07:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-build Pull request with build/testing/packaging improvement submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants