Add tool for [Zoo]Keeper benchmark by alesapin · Pull Request #23038 · ClickHouse/ClickHouse

alesapin · 2021-04-13T12:01:17Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

Build/Testing/Packaging Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add simple tool for benchmarking [Zoo]Keeper.

alesapin · 2021-04-13T12:02:52Z

The code is quite dirty and almost copy-pasted from clickhouse-benchmark.

alesapin · 2021-04-13T12:36:00Z

Single node test, cloud SSD, 30 threads, each thread has own session:

Create requests with 32 bytes payload:

ZooKeeper + fsync:

read requests 0, write requests 376262, Write RPS: 25295.5, Write MiB/s: 3.69092. 

Write sampler:
0%              0.000366999 sec.
10%             0.0010325 sec.
20%             0.00106087 sec.
30%             0.00108079 sec.
40%             0.00109854 sec.
50%             0.00111696 sec.
60%             0.00113803 sec.
70%             0.00116273 sec.
80%             0.00119683 sec.
90%             0.00126652 sec.
95%             0.00142181 sec.
99%             0.00175927 sec.
99.9%           0.0134262 sec.
99.99%          0.0481346 sec.
Requests executed: 399608.

Keeper + fsync:

read requests 0, write requests 46329, Write RPS: 2573.95, Write MiB/s: 0.375571. 

Write sampler:
0%              0.00088921 sec.
10%             0.0105441 sec.
20%             0.0108688 sec.
30%             0.0111468 sec.
40%             0.0113682 sec.
50%             0.0115706 sec.
60%             0.0117689 sec.
70%             0.0119973 sec.
80%             0.0123088 sec.
90%             0.0127822 sec.
95%             0.0132028 sec.
99%             0.0154352 sec.
99.9%           0.0221107 sec.
99.99%          0.0273679 sec.
Requests executed: 49032.

ZooKeeper + no fsync:

read requests 0, write requests 851945, Write RPS: 41469.5, Write MiB/s: 6.05083. 

Write sampler:
0%              0.000157542 sec.
10%             0.000570049 sec.
20%             0.000620664 sec.
30%             0.000639222 sec.
40%             0.000650876 sec.
50%             0.0006617 sec.
60%             0.000673258 sec.
70%             0.000688155 sec.
80%             0.000713563 sec.
90%             0.000776199 sec.
95%             0.000837894 sec.
99%             0.00106101 sec.
99.9%           0.00979552 sec.
99.99%          0.0982639 sec.
Requests executed: 889751.

Keeper + no fsync:

read requests 0, write requests 723333, Write RPS: 21345.3, Write MiB/s: 3.11455. 

Write sampler:
0%              0.000135434 sec.
10%             0.000196433 sec.
20%             0.000216412 sec.
30%             0.000251833 sec.
40%             0.00111033 sec.
50%             0.00128273 sec.
60%             0.00139264 sec.
70%             0.00144637 sec.
80%             0.00197698 sec.
90%             0.00262189 sec.
95%             0.00308307 sec.
99%             0.0049305 sec.
99.9%           0.00741379 sec.
99.99%          0.509156 sec.
Requests executed: 749106.

Get requests with 1KB payload:

ZooKeeper:

read requests 2395135, write requests 0, Read RPS: 65409.3, Read MiB/s: 70.8627
Read sampler:
0%              0.000112252 sec.
10%             0.00040733 sec.
20%             0.000419205 sec.
30%             0.000428029 sec.
40%             0.00043577 sec.
50%             0.000443233 sec.
60%             0.000451186 sec.
70%             0.000461187 sec.
80%             0.000477324 sec.
90%             0.000513454 sec.
95%             0.000552133 sec.
99%             0.000681763 sec.
99.9%           0.0015579 sec.
99.99%          0.00431767 sec.
Requests executed: 2458896.

Keeper:

read requests 2342011, write requests 0, Read RPS: 72108.8, Read MiB/s: 78.1208                                                                                                                                                                
Read sampler:                                                                                                                                                                                                                                  
0%              7.902e-05 sec.                                                                                                                                                                                                                 
10%             0.000124344 sec.                                                                                                                                                                                                               
20%             0.000140124 sec.                                                                                                                                                                                                               
30%             0.000158635 sec.                                                                                                                                                                                                               
40%             0.000193901 sec.                                                                                                                                                                                                               
50%             0.000389395 sec.                                                                                                                                                                                                               
60%             0.000454391 sec.                                                                                                                                                                                                               
70%             0.000508788 sec.                                                                                                                                                                                                               
80%             0.00060724 sec.                                                                                                                                                                                                                
90%             0.000833423 sec.                                                                                                                                                                                                               
95%             0.0010403 sec.                                                                                                                                                                                                                 
99%             0.00151391 sec.                                                                                                                                                                                                                
99.9%           0.0022005 sec.                                                                                                                                                                                                                 
99.99%          0.0028232 sec.                                                                                                                                                                                                                 
Requests executed: 2411798.

List requests with 1k nodes:

ZooKeeper:

read requests 446690, write requests 0, Read RPS: 9514.02, Read MiB/s: 46.31
Read sampler:
0%              0.000362101 sec.
10%             0.00301438 sec.
20%             0.00304879 sec.
30%             0.00307286 sec.
40%             0.0030926 sec.
50%             0.0031106 sec.
60%             0.00312926 sec.
70%             0.00315037 sec.
80%             0.00317789 sec.
90%             0.00322692 sec.
95%             0.00330075 sec.
99%             0.00430313 sec.
99.9%           0.00654918 sec.
99.99%          0.00983435 sec.
Requests executed: 456491.

Keeper:

read requests 626747, write requests 0, Read RPS: 31543.2, Read MiB/s: 153.538
Read sampler:
0%              0.000165994 sec.
10%             0.000228541 sec.
20%             0.000248667 sec.
30%             0.00027718 sec.
40%             0.000497763 sec.
50%             0.000922916 sec.
60%             0.000972462 sec.
70%             0.00104285 sec.
80%             0.00158383 sec.
90%             0.00185349 sec.
95%             0.0024493 sec.
99%             0.00376689 sec.
99.9%           0.00562348 sec.
99.99%          0.00716784 sec.
Requests executed: 658655.

alesapin · 2021-04-13T12:44:16Z

So, for a single-node case:

Writes with fsync more than 10 times slower.
Writes without fsync two times slower.
Reads have comparable performance.
List performs 3 times better (improved directly in this PR).

alesapin · 2021-04-15T11:41:06Z

Two days and I'm able to run bench via Jepsen on three nodes.

Create requests with 32 bytes payload:

ZooKeeper + fsync

read requests 0, write requests 25802, Write RPS: 1430.95, Write MiB/s: 0.208793. 

Write sampler:
0%              0.014457 sec.
10%             0.0163485 sec.
20%             0.0169457 sec.
30%             0.0177653 sec.
40%             0.0191583 sec.
50%             0.0209016 sec.
60%             0.0224936 sec.
70%             0.0231979 sec.
80%             0.0240418 sec.
90%             0.0256752 sec.
95%             0.0276367 sec.
99%             0.0327407 sec.
99.9%           0.0397141 sec.
99.99%          0.041323 sec.
Requests executed: 27407.

Keeper + fsync:

read requests 0, write requests 509, Write RPS: 109.2, Write MiB/s: 0.0159335.                                                                                                                                                                 
                                                                                                                                                                                                                                               
Write sampler:                                                                                                                                                                                                                                 
0%              0.0239873 sec.                                                                                                                                                                                                                 
10%             0.0573686 sec.                                                                                                                                                                                                                 
20%             0.0931894 sec.                                                                                                                                                                                                                 
30%             0.130548 sec.                                                                                                                                                                                                                  
40%             0.163599 sec.                                                                                                                                                                                                                  
50%             0.200042 sec.                                                                                                                                                                                                                  
60%             0.255879 sec.                                                                                                                                                                                                                  
70%             0.332659 sec.                                                                                                                                                                                                                  
80%             0.466651 sec.                                                                                                                                                                                                                  
90%             0.649365 sec.                                                                                                                                                                                                                  
95%             0.709061 sec.                                                                                                                                                                                                                  
99%             0.831386 sec.                                                                                                                                                                                                                  
99.9%           0.950798 sec.                                                                                                                                                                                                                  
99.99%          0.951092 sec.                                                                                                                                                                                                                  
Requests executed: 606.

ZooKeeper no fsync:

read requests 0, write requests 47223, Write RPS: 1967.69, Write MiB/s: 0.287109. 

Write sampler:
0%              0.0136945 sec.
10%             0.0142327 sec.
20%             0.0146209 sec.
30%             0.0149374 sec.
40%             0.0150878 sec.
50%             0.0152056 sec.
60%             0.0153296 sec.
70%             0.0155184 sec.
80%             0.0159007 sec.
90%             0.0161236 sec.
95%             0.0162863 sec.
99%             0.0166911 sec.
99.9%           0.0246752 sec.
99.99%          0.0374032 sec.
Requests executed: 49153.

Keeper no fsync:

read requests 0, write requests 1414, Write RPS: 162.589, Write MiB/s: 0.0237237. 

Write sampler:
0%              0.0153122 sec.
10%             0.0195866 sec.
20%             0.020578 sec.
30%             0.0214698 sec.
40%             0.0238601 sec.
50%             0.0300539 sec.
60%             0.0371975 sec.
70%             0.308709 sec.
80%             0.525133 sec.
90%             0.550743 sec.
95%             0.562532 sec.
99%             0.589284 sec.
99.9%           1.03105 sec.
99.99%          1.05615 sec.
Requests executed: 1539.

Get requests with 1KB payload:

ZooKeeper:

read requests 89890, write requests 0, Read RPS: 4996.3, Read MiB/s: 5.39857                                                                                                                                                                   
Read sampler:                                                                                                                                                                                                                                  
0%              0.000224726 sec.                                                                                                                                                                                                               
10%             0.000304072 sec.                                                                                                                                                                                                               
20%             0.000360596 sec.                                                                                                                                                                                                               
30%             0.000561115 sec.                                                                                                                                                                                                               
40%             0.00698616 sec.                                                                                                                                                                                                                
50%             0.00739363 sec.
60%             0.00777195 sec.
70%             0.00963526 sec.
80%             0.0102356 sec.
90%             0.0104873 sec.
95%             0.010618 sec.
99%             0.0108182 sec.
99.9%           0.0120942 sec.
99.99%          0.0253129 sec.
Requests executed: 94957.

Keeper:

read requests 92966, write requests 0, Read RPS: 4896.13, Read MiB/s: 5.29033
Read sampler:
0%              0.00024687 sec.
10%             0.0003466 sec.
20%             0.0004636 sec.
30%             0.000715148 sec.
40%             0.00707749 sec.
50%             0.00731223 sec.
60%             0.00780329 sec.
70%             0.0096735 sec.
80%             0.0101066 sec.
90%             0.0105194 sec.
95%             0.0107303 sec.
99%             0.0114548 sec.
99.9%           0.0157331 sec.
99.99%          0.0168914 sec.
Requests executed: 97439.

List requests with 1k nodes:

ZooKeeper:

read requests 149195, write requests 0, Read RPS: 4814.23, Read MiB/s: 23.4335
Read sampler:
0%              0.000350123 sec.
10%             0.000464042 sec.
20%             0.000534659 sec.
30%             0.00072968 sec.
40%             0.00722674 sec.
50%             0.00756885 sec.
60%             0.00796418 sec.
70%             0.00983504 sec.
80%             0.0104267 sec.
90%             0.0107893 sec.
95%             0.0109151 sec.
99%             0.0111642 sec.
99.9%           0.0154421 sec.
99.99%          0.025005 sec.
Requests executed: 154062.

Keeper:

read requests 100000, write requests 0, Read RPS: 4515.85, Read MiB/s: 21.9812
Read sampler:
0%              0.000328294 sec.
10%             0.00047172 sec.
20%             0.000589997 sec.
30%             0.000825524 sec.
40%             0.00734495 sec.
50%             0.00790825 sec.
60%             0.00811202 sec.
70%             0.00998036 sec.
80%             0.0105075 sec.
90%             0.0108722 sec.
95%             0.0110669 sec.
99%             0.0235671 sec.
99.9%           0.0286841 sec.
99.99%          0.0299338 sec.

Get requests with 32B payload:

ZooKeeper

read requests 95316, write requests 0, Read RPS: 5018.7, Read MiB/s: 0.674855
Read sampler:
0%              0.000205932 sec.
10%             0.000300756 sec.
20%             0.000352693 sec.
30%             0.000530309 sec.
40%             0.00702609 sec.
50%             0.00746019 sec.
60%             0.00777183 sec.
70%             0.00959763 sec.
80%             0.00991684 sec.
90%             0.0104829 sec.
95%             0.0106375 sec.
99%             0.0108431 sec.
99.9%           0.0116626 sec.
99.99%          0.0222432 sec.
Requests executed: 100000.

Keeper:

read requests 97176, write requests 0, Read RPS: 4628.93, Read MiB/s: 0.622443
Read sampler:
0%              0.000233582 sec.
10%             0.000342032 sec.
20%             0.000450943 sec.
30%             0.000675288 sec.
40%             0.00725061 sec.
50%             0.00778561 sec.
60%             0.00798117 sec.
70%             0.00977069 sec.
80%             0.0103331 sec.
90%             0.0106636 sec.
95%             0.0108493 sec.
99%             0.0225459 sec.
99.9%           0.0293692 sec.
99.99%          0.030455 sec.
Requests executed: 100000.

alesapin · 2021-04-16T13:51:35Z

the code became so optimal, that starts losing data:

INFO [2021-04-16 16:49:22,261] jepsen results - jepsen.store Wrote /home/alesap/code/cpp/ClickHouse/tests/jepsen.clickhouse-keeper/store/clickhouse-keeper-quorum=false-set-bridge-partitioner/20210416T164635.000+0300/results.edn
INFO [2021-04-16 16:49:22,305] jepsen test runner - jepsen.core {:perf {:latency-graph {:valid? true},
        :rate-graph {:valid? true},
        :valid? true},
 :workload {:ok-count 440,
            :valid? false,
            :lost-count 1,
            :lost "#{0}",
            :acknowledged-count 440,
            :recovered "#{70}",
            :ok "#{2..3 6..7 13..14 16 21 26..27 30 34..35 40 43 46 55 57 61 63 66 70 75 88..89 91 94 97 102..103 106 117 119 121 126 130..131 136 139 141 147 149 157 159 163 167..168 170..171 179 183 191 194 197 203 206 209 213 220 223 226 229 233 236 241 244 247 252 255 261 265 268..269 274..275 280 283 285 289 291 293 297 300..302 304 307 310 312 318 323 328 330 333 336 338 340 342 344 354..355 361 365 369 376 381 388 392 394 400 404..405 407 413 416 419 424..425 429 434..435 440 442 445 449 451 455..456 460 465 470 472..473 479 482 485 488 491 496 501 505 507 510 513 516 523..524 527..528 531 533 538 540 546 548 551 555 562 566 568 574 578 583 585 588 590 594..595 601..602 607 612 616 620..621 623 626..627 629 631 640 642 647..648 653 656 658 661..662 666 669 675 680..681 684 687 690 696 699 703 706 710 713 716 718 721 724 728 730 735 739 741 745 748..749 753 758 760 763 768 771 773 776 779 785 789 791 793 797 801 804 811..812 816 820 822 826 829 831 834 838 840 846 852 856 863..864 868 870 872 876 879 883 887 891 895 897 900 906 911 914 918 920 922 925 929 932 934 940 942 951 954 957..958 962 966 970 972 976 980 984 987 989 994 999 1002 1008 1012 1014 1018 1020 1023 1025 1028 1031 1033 1041 1045 1047 1049 1052 1054 1058 1061 1064..1065 1067 1074 1077 1081..1082 1087 1090 1093 1097..1098 1102 1105..1106 1109 1112 1116..1117 1119..1120 1125..1126 1132 1135 1138 1142 1145 1150 1152 1154 1159 1164 1167 1172..1173 1175 1179 1182 1185..1186 1192 1196 1199..1200 1202 1207 1210 1213 1215 1218 1221 1224 1228 1232..1233 1236 1238 1242 1245 1249 1252 1257 1261..1262 1266 1268 1273 1275 1277 1279 1284 1287 1289 1291 1295 1297 1300 1304 1309..1310 1314..1315 1319 1323 1325 1329 1333 1337 1340 1345 1350 1353 1356 1360 1364 1370 1373 1375 1378 1382 1386 1390 1392 1395 1397 1403 1406 1409 1413 1418 1424 1427 1429 1433 1435 1438 1441 1444 1449 1451 1454 1456 1458 1464 1466 1472 1475 1477 1483 1489 1492 1495 1500 1503 1506}",
            :attempt-count 1507,
            :unexpected "#{}",
            :unexpected-count 0,
            :recovered-count 1},
 :valid? false}


Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻

alesapin · 2021-04-16T14:02:44Z

Really, it's not possible to write such systems without jepsen.

alesapin · 2021-04-16T19:19:11Z

src/Coordination/KeeperServer.cpp

    params.auto_forwarding_req_timeout_ = coordination_settings->operation_timeout_ms.totalMilliseconds() * 2;

-    params.return_method_ = nuraft::raft_params::blocking;
+    params.return_method_ = nuraft::raft_params::async_handler;


Main change. In this mode, NuRaft doesn't return the result from append_entries. But we haven't used it already for normal requests. Only session-id request relied on append_entries result. To avoid this I've added internal ZooKeeperSessionIDRequest and ZooKeeperSessionIDResponse.
And now session-id requests processed like all other types of requests.

alesapin · 2021-04-16T19:21:56Z

src/Coordination/KeeperLogStore.h


    uint64_t size() const;

+    void end_of_append_batch(uint64_t start_index, uint64_t count) override;


This function called after each append batch.

alesapin · 2021-04-16T19:25:07Z

src/Coordination/KeeperStorageDispatcher.cpp

+    /// If we get some errors, than send them to clients
+    if (!result->get_accepted() || result->get_result_code() == nuraft::cmd_result_code::TIMEOUT)
+        addErrorResponses(std::move(requests_for_sessions), Coordination::Error::ZOPERATIONTIMEOUT);
+    else if (result->get_result_code() != nuraft::cmd_result_code::OK)


Actually, I'm not sure about this. We will return the error to the client, but this request can be processed later in the background and we will return two responses (error and normal stale response) with the same XID. It shouldn't lead to any bugs but can be inconvenient behavior.

alesapin · 2021-04-16T19:26:32Z

tests/jepsen.clickhouse-keeper/src/jepsen/clickhouse_keeper/db.clj

-           (info node "Collecting traces")
-           (collect-traces test node))
-         (info node "Pid files doesn't exists"))
+       ;(if (cu/exists? pid-file-path)


Need to make it optional

alesapin · 2021-04-17T13:01:49Z

Wow, nothing is broken.

alesapin · 2021-04-17T13:51:18Z

Three nodes:

Create requests with 32 bytes payload (30 threads):

Keeper + fsync:

read requests 0, write requests 10443, Write RPS: 613.018, Write MiB/s: 0.0894468. 

Write sampler:
0%              0.0136349 sec.
10%             0.031579 sec.
20%             0.0368288 sec.
30%             0.0417666 sec.
40%             0.0445854 sec.
50%             0.0477322 sec.
60%             0.0522781 sec.
70%             0.0562037 sec.
80%             0.0604629 sec.
90%             0.0672504 sec.
95%             0.0724094 sec.
99%             0.0854521 sec.
99.9%           0.101955 sec.
99.99%          0.109814 sec.
Requests executed: 10863.

ZooKeeper + fsync:

read requests 0, write requests 36848, Write RPS: 1599.3, Write MiB/s: 0.233357. 

Write sampler:
0%              0.0141493 sec.
10%             0.0156919 sec.
20%             0.0160489 sec.
30%             0.0163825 sec.
40%             0.0168012 sec.
50%             0.017558 sec.
60%             0.018532 sec.
70%             0.0198072 sec.
80%             0.0219909 sec.
90%             0.0237886 sec.
95%             0.0251934 sec.
99%             0.0272491 sec.
99.9%           0.0316109 sec.
99.99%          0.0353354 sec.
Requests executed: 38592.

========================================================
Keeper + no fsync

read requests 0, write requests 16501, Write RPS: 969.541, Write MiB/s: 0.141468. 

Write sampler:
0%              0.0083458 sec.
10%             0.0201698 sec.
20%             0.0227277 sec.
30%             0.027644 sec.
40%             0.0299196 sec.
50%             0.0306739 sec.
60%             0.032674 sec.
70%             0.0364063 sec.
80%             0.0393274 sec.
90%             0.0416355 sec.
95%             0.0422553 sec.
99%             0.0465217 sec.
99.9%           0.0522805 sec.
99.99%          0.052803 sec.
Requests executed: 17461.

ZooKeeper + no fsync:

read requests 0, write requests 43790, Write RPS: 1990.21, Write MiB/s: 0.290396. 

Write sampler:
0%              0.0134352 sec.
10%             0.0142786 sec.
20%             0.0146583 sec.
30%             0.0148276 sec.
40%             0.0149362 sec.
50%             0.0150342 sec.
60%             0.0151419 sec.
70%             0.0152859 sec.
80%             0.0155621 sec.
90%             0.0158273 sec.
95%             0.0159994 sec.
99%             0.0164191 sec.
99.9%           0.0238023 sec.
99.99%          0.0304281 sec.
Requests executed: 45778.

Create requests with 1KB bytes payload (30 threads):

Keeper + fsync:

read requests 0, write requests 14695, Write RPS: 609.84, Write MiB/s: 0.088983. 

Write sampler:
0%              0.0210272 sec.
10%             0.0313804 sec.
20%             0.0348638 sec.
30%             0.0382743 sec.
40%             0.0415507 sec.
50%             0.0448579 sec.
60%             0.0486332 sec.
70%             0.0538877 sec.
80%             0.0607805 sec.
90%             0.0728562 sec.
95%             0.0844131 sec.
99%             0.109676 sec.
99.9%           0.140522 sec.
99.99%          0.171038 sec.
Requests executed: 15334.

ZooKeeper + fsync:

read requests 0, write requests 21151, Write RPS: 1242.68, Write MiB/s: 1.35696. 

Write sampler:
0%              0.0148176 sec.
10%             0.018527 sec.
20%             0.0197312 sec.
30%             0.0210381 sec.
40%             0.0225738 sec.
50%             0.0240889 sec.
60%             0.0251214 sec.
70%             0.0261969 sec.
80%             0.027617 sec.
90%             0.02981 sec.
95%             0.0323914 sec.
99%             0.0382187 sec.
99.9%           0.047426 sec.
99.99%          0.0493344 sec.
Requests executed: 22203.

========================================================
Keeper + no fsync:

read requests 0, write requests 20626, Write RPS: 1375.03, Write MiB/s: 1.50148. 

Write sampler:
0%              0.0121681 sec.
10%             0.0163623 sec.
20%             0.0174695 sec.
30%             0.0198829 sec.
40%             0.0210281 sec.
50%             0.0226018 sec.
60%             0.0236479 sec.
70%             0.02416 sec.
80%             0.0246148 sec.
90%             0.0256651 sec.
95%             0.027279 sec.
99%             0.0301931 sec.
99.9%           0.0351075 sec.
99.99%          0.044221 sec.
Requests executed: 22012.

ZooKeeper + no fsync:

read requests 0, write requests 36540, Write RPS: 2029.77, Write MiB/s: 2.21642. 

Write sampler:
0%              0.0133006 sec.
10%             0.0138801 sec.
20%             0.0142744 sec.
30%             0.0144637 sec.
40%             0.0145677 sec.
50%             0.0146666 sec.
60%             0.0147718 sec.
70%             0.0149063 sec.
80%             0.0151614 sec.
90%             0.015505 sec.
95%             0.0157144 sec.
99%             0.0168099 sec.
99.9%           0.0331699 sec.
99.99%          0.0511356 sec.
Requests executed: 38594.

Get requests with 32B payload:

Keeper:

read requests 82594, write requests 0, Read RPS: 4861.98, Read MiB/s: 0.653781
Read sampler:
0%              0.000233267 sec.
10%             0.000338061 sec.
20%             0.000442006 sec.
30%             0.000647172 sec.
40%             0.00706847 sec.
50%             0.00761671 sec.
60%             0.0078359 sec.
70%             0.00974856 sec.
80%             0.0104101 sec.
90%             0.0107367 sec.
95%             0.0108721 sec.
99%             0.0113342 sec.
99.9%           0.0192329 sec.
99.99%          0.0205943 sec.
Requests executed: 87190.

ZooKeeper:

read requests 85167, write requests 0, Read RPS: 5013.23, Read MiB/s: 0.674119                                                                                                                                       
Read sampler:
0%              0.000216212 sec.
10%             0.000300252 sec.
20%             0.000350153 sec.
30%             0.00051137 sec.
40%             0.00696042 sec.
50%             0.00718053 sec.
60%             0.00770792 sec.
70%             0.00979858 sec.
80%             0.01004 sec.
90%             0.0104524 sec.
95%             0.0106109 sec.
99%             0.0108332 sec.
99.9%           0.0119236 sec.
99.99%          0.0204775 sec.
Requests executed: 90205.

List requests with 1k nodes:

Keeper:

read requests 81861, write requests 0, Read RPS: 4817.56, Read MiB/s: 23.4497                                                                                                                                        
Read sampler:                                                                                                                                                                                                        
0%              0.000329031 sec.                                                                                                                                                                                     
10%             0.000464794 sec.                                                                                                                                                                                     
20%             0.000581207 sec.                                                                                                                                                                                     
30%             0.000795608 sec.                                                                                                                                                                                     
40%             0.00731821 sec.                                                                                                                                                                                      
50%             0.00784553 sec.                                                                                                                                                                                      
60%             0.00802399 sec.                                                                                                                                                                                      
70%             0.00983363 sec.                                                                                                                                                                                      
80%             0.0100954 sec.                                                                                                                                                                                       
90%             0.0106058 sec.                                                                                                                                                                                       
95%             0.0108672 sec.                                                                                                                                                                                       
99%             0.0111464 sec.                                                                                                                                                                                       
99.9%           0.0128877 sec.                                                                                                                                                                                       
99.99%          0.0148837 sec.                                                                                                                                                                                       
Requests executed: 86553.

ZooKeeper:

read requests 81329, write requests 0, Read RPS: 4787.56, Read MiB/s: 23.3037                                                                                                                                        
Read sampler:                                                                                                                                                                                                        
0%              0.00034377 sec.                                                                                                                                                                                      
10%             0.000467976 sec.                                                                                                                                                                                     
20%             0.000551229 sec.                                                                                                                                                                                     
30%             0.000752686 sec.                                                                                                                                                                                     
40%             0.00725132 sec.                                                                                                                                                                                      
50%             0.00757969 sec.                                                                                                                                                                                      
60%             0.00796923 sec.                                                                                                                                                                                      
70%             0.00998262 sec.                                                                                                                                                                                      
80%             0.0105258 sec.                                                                                                                                                                                       
90%             0.0108046 sec.                                                                                                                                                                                       
95%             0.0109385 sec.                                                                                                                                                                                       
99%             0.0112254 sec.                                                                                                                                                                                       
99.9%           0.0181218 sec.                                                                                                                                                                                       
99.99%          0.0284515 sec.                                                                                                                                                                                       
Requests executed: 86184.

alesapin · 2021-04-17T13:51:54Z

So ZooKeeper still 2-3 times better for write requests.

BTW results with three nodes look quite strange to me (all requests too slow).

alesapin · 2021-04-17T20:00:37Z

01305_replica_create_drop_zookeeper failures.... it can be related to changes...

alesapin · 2021-04-17T21:04:50Z

01305_replica_create_drop_zookeeper failures.... it can be related to changes...

Got it, we have a bug in the atomicity of drop/create for ReplicatedMergeTree.

alesapin · 2021-04-18T18:46:03Z

I'll merge master one more time

alesapin · 2021-04-19T07:47:25Z

No related failures. I'll merge this PR and will continue performance improvements in the next one.

zhanglistar · 2021-11-01T04:15:16Z

So ZooKeeper still 2-3 times better for write requests.
Do you guys see why?

zhanglistar · 2021-11-01T10:23:05Z

So ZooKeeper still 2-3 times better for write requests.

BTW results with three nodes look quite strange to me (all requests too slow).

Need to note that in zookeeper, there is only to sync data to disk not including meta, but in keeper, the system call fsync syncs both the data and meta to disk.
See zookeeper source file: commit()@FileTxnLog.java

zhanglistar · 2021-11-09T01:18:45Z

@alesapin I have tested in my local env. to see that setting <compress_logs>false</compress_logs> and using fsyncdata() insteadof fsync(), the writing performance is comparable to zookeeper.

alexey-milovidov · 2021-11-09T01:25:46Z

fdatasync makes sense but we are appending to the file, so it's not very safe.

How compression affects write performance needs more investigation. It should not...

alexey-milovidov · 2021-11-09T01:28:02Z

fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respec‐
tively, time of last access and time of last modification; see inode(7)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size
(st_size, as made by say ftruncate(2)), would require a metadata flush.

Should be alright 🚀
Why don't we use it everywhere?

zhanglistar · 2021-11-09T01:49:15Z

fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respec‐
tively, time of last access and time of last modification; see inode(7)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size
(st_size, as made by say ftruncate(2)), would require a metadata flush.

Should be alright 🚀 Why don't we use it everywhere?

Yes you are right, should use everywhere.

zhanglistar · 2021-11-09T01:56:06Z

fdatasync makes sense but we are appending to the file, so it's not very safe.

How compression affects write performance needs more investigation. It should not...

From the perf top, I see the zstd compress function is on the top, so I try to close the compression and it is better.
The logic maybe like this, compression needs more cpu and time, and the thread is holding lock, then the competition for lock is greater, it should do the least thing in a lock.

zhanglistar · 2021-11-09T04:27:17Z

Another finding, <compress_snapshots_with_zstd_format>false</compress_snapshots_with_zstd_format>
using the default lz4 compressing config is better than zstd(3).

zhanglistar · 2021-11-09T06:23:15Z

Totally, the optimization is as following:

using fdatasync() instead of fsync
do not compress log
using lz4(3) instead of zstd(3)
increase batch of raft request to 1k which equals to batch size in zookeeper.
Then, the writing performance is comparable with zookeeper using one process.

Complete config is:

<clickhouse>
<keeper_server>
    <tcp_port>12183</tcp_port>
    <server_id>1</server_id>
    <log_storage_path>/data3/clickhouse/coordination/log</log_storage_path>
    <snapshot_storage_path>/data4/clickhouse/coordination/snapshots</snapshot_storage_path>

    <coordination_settings>
        <operation_timeout_ms>10000</operation_timeout_ms>
        <session_timeout_ms>30000</session_timeout_ms>
        <raft_logs_level>warning</raft_logs_level>
	<compress_logs>false</compress_logs>
	<max_requests_batch_size>1000</max_requests_batch_size>
	<compress_snapshots_with_zstd_format>false</compress_snapshots_with_zstd_format>
    </coordination_settings>

    <raft_configuration>
        <server>
            <id>1</id>
            <hostname>zoo1</hostname>
            <port>9444</port>
        </server>
    </raft_configuration>
</keeper_server>
</clickhouse>

alexey-milovidov · 2021-11-09T23:52:16Z

#31181

alesapin added 4 commits April 12, 2021 11:10

Some code

fea0ec5

Better

05959d4

More fixes

13d3e59

Something working

b115b8a

robot-clickhouse added the pr-build Pull request with build/testing/packaging improvement label Apr 13, 2021

alesapin marked this pull request as draft April 13, 2021 12:01

Less garbage

6079284

Some missed files

8653975

alesapin mentioned this pull request Apr 13, 2021

Add NuRaft to TestKeeper #15090

Closed

alesapin added the jepsen-test Need to test this PR with jepsen tests label Apr 13, 2021

alesapin added 6 commits April 13, 2021 16:26

Fix some stupid things

9a59c12

Integrating into jepsen

a893cd3

Teaching jepsen to run external programs

c506e32

Upload binary via scp

e9e9a7d

Merge branch 'master' into keeper_bench_mark

50103ff

Now jepsen knows how to run zookeeper and perform same tests

e5ac97a

alesapin added 3 commits April 15, 2021 14:46

ZooKeeper files

977ad74

Fix style

2857712

More optimal

f168bfa

alesapin added 5 commits April 16, 2021 16:53

Fix obvious bug

e724952

Followup fix

aecab0a

Followup fix

539f6cd

Followup fix

7f69910

More checks

7ed32dd

alesapin commented Apr 16, 2021

View reviewed changes

alesapin added 2 commits April 16, 2021 23:07

Fix unit tests with new API

1b4e3ea

Merge branch 'master' into keeper_bench_mark

9fb3784

Fix tidy

ce848ff

Merge branch 'master' into keeper_bench_mark

2035fb3

alesapin marked this pull request as ready for review April 18, 2021 15:30

Merge branch 'master' into keeper_bench_mark

912db13

alesapin merged commit 14800f0 into master Apr 19, 2021

alesapin deleted the keeper_bench_mark branch April 19, 2021 07:47

Algunenano mentioned this pull request Apr 19, 2021

Link keeper-bench to clickhouse_common_zookeeper #23302

Merged

This was referenced Nov 11, 2021

Add intel oneTBB library and improve clickhouse-keeper performance #31302

Closed

Change all fsync to fdatasync #31229

Merged

alexey-milovidov mentioned this pull request Nov 24, 2021

Roadmap 2021 (discussion) #17623

Closed


		uint64_t size() const;

		void end_of_append_batch(uint64_t start_index, uint64_t count) override;

Conversation

alesapin commented Apr 13, 2021

Uh oh!

alesapin commented Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Apr 13, 2021

Create requests with 32 bytes payload:

Get requests with 1KB payload:

List requests with 1k nodes:

Uh oh!

alesapin commented Apr 13, 2021

Uh oh!

alesapin commented Apr 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Create requests with 32 bytes payload:

Get requests with 1KB payload:

List requests with 1k nodes:

Get requests with 32B payload:

Uh oh!

alesapin commented Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Apr 16, 2021

Uh oh!

alesapin Apr 16, 2021

Choose a reason for hiding this comment

Uh oh!

alesapin Apr 16, 2021

Choose a reason for hiding this comment

Uh oh!

alesapin Apr 16, 2021

Choose a reason for hiding this comment

Uh oh!

alesapin Apr 16, 2021

Choose a reason for hiding this comment

Uh oh!

alesapin commented Apr 17, 2021

Uh oh!

alesapin commented Apr 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Create requests with 32 bytes payload (30 threads):

Create requests with 1KB bytes payload (30 threads):

Get requests with 32B payload:

List requests with 1k nodes:

Uh oh!

alesapin commented Apr 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Apr 17, 2021

Uh oh!

alesapin commented Apr 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin commented Apr 18, 2021

Uh oh!

alesapin commented Apr 19, 2021

Uh oh!

zhanglistar commented Nov 1, 2021

Uh oh!

zhanglistar commented Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhanglistar commented Nov 9, 2021

Uh oh!

alexey-milovidov commented Nov 9, 2021

Uh oh!

alexey-milovidov commented Nov 9, 2021

Uh oh!

zhanglistar commented Nov 9, 2021

Uh oh!

zhanglistar commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhanglistar commented Nov 9, 2021

Uh oh!

zhanglistar commented Nov 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Nov 9, 2021

alesapin commented Apr 13, 2021 •

edited

Loading

alesapin commented Apr 15, 2021 •

edited

Loading

alesapin commented Apr 16, 2021 •

edited

Loading

alesapin commented Apr 17, 2021 •

edited

Loading

alesapin commented Apr 17, 2021 •

edited

Loading

alesapin commented Apr 17, 2021 •

edited

Loading

zhanglistar commented Nov 1, 2021 •

edited

Loading

zhanglistar commented Nov 9, 2021 •

edited

Loading

zhanglistar commented Nov 9, 2021 •

edited

Loading