Conversation
|
The code is quite dirty and almost copy-pasted from |
|
Single node test, cloud SSD, 30 threads, each thread has own session: Create requests with 32 bytes payload:ZooKeeper + fsync: Keeper + fsync: ZooKeeper + no fsync: Keeper + no fsync: Get requests with 1KB payload:ZooKeeper: Keeper: List requests with 1k nodes:ZooKeeper: Keeper: |
|
So, for a single-node case:
|
|
Two days and I'm able to run bench via Jepsen on three nodes. Create requests with 32 bytes payload:ZooKeeper + fsync Keeper + fsync: ZooKeeper no fsync: Keeper no fsync: Get requests with 1KB payload:ZooKeeper: Keeper: List requests with 1k nodes:ZooKeeper: Keeper: Get requests with 32B payload:ZooKeeper Keeper: |
|
the code became so optimal, that starts losing data: |
|
Really, it's not possible to write such systems without jepsen. |
| params.auto_forwarding_req_timeout_ = coordination_settings->operation_timeout_ms.totalMilliseconds() * 2; | ||
|
|
||
| params.return_method_ = nuraft::raft_params::blocking; | ||
| params.return_method_ = nuraft::raft_params::async_handler; |
There was a problem hiding this comment.
Main change. In this mode, NuRaft doesn't return the result from append_entries. But we haven't used it already for normal requests. Only session-id request relied on append_entries result. To avoid this I've added internal ZooKeeperSessionIDRequest and ZooKeeperSessionIDResponse.
And now session-id requests processed like all other types of requests.
|
|
||
| uint64_t size() const; | ||
|
|
||
| void end_of_append_batch(uint64_t start_index, uint64_t count) override; |
There was a problem hiding this comment.
This function called after each append batch.
| /// If we get some errors, than send them to clients | ||
| if (!result->get_accepted() || result->get_result_code() == nuraft::cmd_result_code::TIMEOUT) | ||
| addErrorResponses(std::move(requests_for_sessions), Coordination::Error::ZOPERATIONTIMEOUT); | ||
| else if (result->get_result_code() != nuraft::cmd_result_code::OK) |
There was a problem hiding this comment.
Actually, I'm not sure about this. We will return the error to the client, but this request can be processed later in the background and we will return two responses (error and normal stale response) with the same XID. It shouldn't lead to any bugs but can be inconvenient behavior.
| (info node "Collecting traces") | ||
| (collect-traces test node)) | ||
| (info node "Pid files doesn't exists")) | ||
| ;(if (cu/exists? pid-file-path) |
There was a problem hiding this comment.
Need to make it optional
|
Wow, nothing is broken. |
|
Three nodes: Create requests with 32 bytes payload (30 threads):Keeper + fsync: ZooKeeper + fsync: ======================================================== ZooKeeper + no fsync: Create requests with 1KB bytes payload (30 threads):Keeper + fsync: ZooKeeper + fsync: ======================================================== ZooKeeper + no fsync: Get requests with 32B payload:Keeper: ZooKeeper: List requests with 1k nodes:Keeper: ZooKeeper: |
|
So ZooKeeper still 2-3 times better for write requests. BTW results with three nodes look quite strange to me (all requests too slow). |
|
|
Got it, we have a bug in the atomicity of drop/create for ReplicatedMergeTree. |
|
I'll merge master one more time |
|
No related failures. I'll merge this PR and will continue performance improvements in the next one. |
|
Need to note that in zookeeper, there is only to sync data to disk not including meta, but in keeper, the system call fsync syncs both the data and meta to disk. |
|
@alesapin I have tested in my local env. to see that setting <compress_logs>false</compress_logs> and using fsyncdata() insteadof fsync(), the writing performance is comparable to zookeeper. |
|
How compression affects write performance needs more investigation. It should not... |
Should be alright 🚀 |
Yes you are right, should use everywhere. |
From the perf top, I see the zstd compress function is on the top, so I try to close the compression and it is better. |
|
Another finding, <compress_snapshots_with_zstd_format>false</compress_snapshots_with_zstd_format> |
|
Totally, the optimization is as following:
Complete config is: |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add simple tool for benchmarking [Zoo]Keeper.