Skip to content

Speed up parts loading process of MergeTree by metadata cache#32928

Merged
CurtizJ merged 71 commits intoClickHouse:masterfrom
bigo-sg:rocksdb_metacache
Mar 29, 2022
Merged

Speed up parts loading process of MergeTree by metadata cache#32928
CurtizJ merged 71 commits intoClickHouse:masterfrom
bigo-sg:rocksdb_metacache

Conversation

@taiyang-li
Copy link
Copy Markdown
Contributor

@taiyang-li taiyang-li commented Dec 18, 2021

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Speed up parts loading process of MergeTree to accelerate starting up of clickhouse-server.
With this improvement, clickhouse-server was able to decrease starting up time from 75 minutes to 20 seconds, with 700k mergetree parts.

Detailed description / Documentation draft:
We use rocksdb to cache metadata files of mergetree parts. When clickhouse-server starts up and loads parts parallelly, first get metadata from rocksdb with file path as key, if cache miss happends, then read metadata file from IDisk and update cache.

System table merge_tree_meta_cache and function checkPartMetaCache were implemented to check cache consistency conveniently.

Thanks for the initial working of @sundy-li.

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Dec 18, 2021
@taiyang-li taiyang-li changed the title Speed up parts loading process of MergeTree to accelerate starting up of clickhouse-server Speed up parts loading process of MergeTree by metadata cache Dec 18, 2021
@alesapin alesapin added the can be tested Allows running workflows for external contributors label Dec 18, 2021
@alesapin
Copy link
Copy Markdown
Member

@Mergifyio update

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Dec 18, 2021

update

❌ Pull request can't be updated with latest base branch changes

Details

Mergify needs the author permission to update the base branch of the pull request.
bigo-sg needs to authorize modification on its head branch.
err-code: 4DA30

@CurtizJ CurtizJ self-assigned this Dec 19, 2021
@taiyang-li
Copy link
Copy Markdown
Contributor Author

update

❌ Pull request can't be updated with latest base branch changes

Mergify needs the author permission to update the base branch of the pull request. bigo-sg needs to authorize modification on its head branch. err-code: 4DA30

@CurtizJ I have no idea why this happend, do you know why ?

@Algunenano
Copy link
Copy Markdown
Member

@CurtizJ I have no idea why this happend, do you know why ?

You need to enable edits by maintainers (on the right menu of the PR):
image

@alesapin
Copy link
Copy Markdown
Member

@CurtizJ I have no idea why this happend, do you know why ?

You need to enable edits by maintainers (on the right menu of the PR): image

Sometimes it can be impossible if you have very old fork.

@taiyang-li
Copy link
Copy Markdown
Contributor Author

@CurtizJ I have no idea why this happend, do you know why ?

You need to enable edits by maintainers (on the right menu of the PR): image

Sometimes it can be impossible if you have very old fork.

I can't find this option in PR. Indeed I had very old fork in 2019

@azat
Copy link
Copy Markdown
Member

azat commented Dec 22, 2021

I can't find this option in PR. Indeed I had very old fork in 2019

AFAIR it is not about fork time (and I had more older fork), but about fork in some orgs.

@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented Dec 23, 2021

I have no idea why this happend, do you know why ?

@taiyang-li you can commit as usual and it should work well.

@taiyang-li
Copy link
Copy Markdown
Contributor Author

@taiyang-li you can commit as usual and it should work well.

ok

@taiyang-li
Copy link
Copy Markdown
Contributor Author

@CurtizJ If I want to wrap function DB::MergeTreePartition::load with USE_ROCKSDB, which style do you advise?

First style:

#if USE_ROCKSDB
    void load(const MergeTreeData & storage, const PartMetaCachePtr & meta_cache, const DiskPtr & disk, const String & part_path);
#else
    void load(const MergeTreeData & storage, const DiskPtr & disk, const String & part_path);
#end

Second style

  void load(const MergeTreeData & storage, 
#if USE_ROCKSDB
const PartMetaCachePtr & meta_cache, 
#endif
const DiskPtr & disk, 
const String & part_path);

@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented Dec 27, 2021

I prefer the first variant.

@taiyang-li
Copy link
Copy Markdown
Contributor Author

image

It's weird

@taiyang-li
Copy link
Copy Markdown
Contributor Author

@taiyang-li
Copy link
Copy Markdown
Contributor Author

Failed test 00062_replicated_merge_tree_alter_zookeeper_long also can't be reproduced too.

@taiyang-li
Copy link
Copy Markdown
Contributor Author

All checks have passed @CurtizJ

@CurtizJ
Copy link
Copy Markdown
Member

CurtizJ commented Mar 29, 2022

@taiyang-li Some tests have failed because of randomization of settings. The CI is unstable for the last several days, sorry for that.

The checks weren't started at all for the last commit, but the previous commit before merge of master was ok, so let's merge.

@CurtizJ CurtizJ merged commit d42632d into ClickHouse:master Mar 29, 2022
@taiyang-li
Copy link
Copy Markdown
Contributor Author

@CurtizJ Thanks!

azat referenced this pull request in azat/ClickHouse Jul 18, 2022
getauxval() from glibc-compatibility did not work always correctly:

- It does not work after setenv(), and this breaks vsyscalls,
  like sched_getcpu() [1] (and BaseDaemon.cpp always set TZ if timezone
  is defined, which is true for CI [2]).

  Also note, that fixing setenv() will not fix LSan,
  since the culprit is getauxval()

  [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1163404
  [2]: ClickHouse#32928 (comment)

- Another think that is definitely broken is LSan (Leak Sanitizer), it
  relies on worked getauxval() but it does not work if __environ is not
  initialized yet (there is even a commit about this).

  And because of, at least, one leak had been introduced [3]:

    [3]: ClickHouse#33840

Fix this by using /proc/self/auxv with fallback to environ solution to
make it compatible with environment that does not allow reading from
auxv (or no procfs).

v2: add fallback to environ solution
v3: fix return value for __auxv_init_procfs()
Refs: ClickHouse#33957
Signed-off-by: Azat Khuzhin <[email protected]>
azat referenced this pull request in azat/ClickHouse Jul 24, 2022
getauxval() from glibc-compatibility did not work always correctly:

- It does not work after setenv(), and this breaks vsyscalls,
  like sched_getcpu() [1] (and BaseDaemon.cpp always set TZ if timezone
  is defined, which is true for CI [2]).

  Also note, that fixing setenv() will not fix LSan,
  since the culprit is getauxval()

  [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1163404
  [2]: ClickHouse#32928 (comment)

- Another think that is definitely broken is LSan (Leak Sanitizer), it
  relies on worked getauxval() but it does not work if __environ is not
  initialized yet (there is even a commit about this).

  And because of, at least, one leak had been introduced [3]:

    [3]: ClickHouse#33840

Fix this by using /proc/self/auxv with fallback to environ solution to
make it compatible with environment that does not allow reading from
auxv (or no procfs).

v2: add fallback to environ solution
v3: fix return value for __auxv_init_procfs()
(cherry picked from commit f187c34)
v4: more verbose message on errors, CI founds [1]:
    AUXV already has value (529267711)
    [1]: https://s3.amazonaws.com/clickhouse-test-reports/39103/2325f7e8442d1672ce5fb43b11039b6a8937e298/stress_test__memory__actions_.html
v5: break at AT_NULL
v6: ignore AT_IGNORE
v7: suppress TSan and remove superior check to avoid abort() in case of race
v8: proper suppressions (not inner function but itself)
Refs: ClickHouse#33957
Signed-off-by: Azat Khuzhin <[email protected]>
@ardenwick
Copy link
Copy Markdown
Contributor

ardenwick commented Sep 6, 2022

@taiyang-li Just out of curiosity, how do you noticed or observed the performance bottleneck of metadata loading? Is there any tools to help? Thanks.

@taiyang-li
Copy link
Copy Markdown
Contributor Author

@cangyin You can notice that disk io util is very high before optimization, mainly because of too many read operations of small metadata files. So using rocksdb to speed up loading is a natural idea.

@liby888
Copy link
Copy Markdown

liby888 commented Jan 30, 2026

We found that using this cache capability indeed resolved the startup time issue, but we also discovered a memory leak. Especially under high concurrency, the host would hit the memory limit and restart once a week. Although this MR was later deleted, the memory leak issue was never resolved. We would like to continue using this feature.Here is the leaked stack.

==3135795==
==3135795== 8 bytes in 1 blocks are definitely lost in loss record 111 of 14,204
==3135795== at 0x1B600816: operator new[](unsigned long) (vg_replace_malloc.c:579)
==3135795== by 0x197CAE76: rocksdb::VersionStorageInfo::VersionStorageInfo(rocksdb::InternalKeyComparator const*, rocksdb::Comparator const*, int, rocksdb::CompactionStyle, rocksdb::VersionStorageInfo*, bool) (in /bin/clickhouse)
==3135795== by 0x197CB5FD: rocksdb::Version::Version(rocksdb::ColumnFamilyData*, rocksdb::VersionSet*, rocksdb::FileOptions const&, rocksdb::MutableCFOptions, std::__1::shared_ptrrocksdb::IOTracer const&, unsigned long) (in /bin/clickhouse)
==3135795== by 0x197DC70B: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (in bin/clickhouse)
==3135795== by 0x19804C27: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (in /bin/clickhouse)
==3135795== by 0x19804B7D: rocksdb::VersionEditHandler::Initialize() (in /clickhouse)
==3135795== by 0x19803A32: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (in clickhouse)
==3135795== by 0x197DDA74: rocksdb::VersionSet::Recover(std::__1::vector<rocksdb::ColumnFamilyDescriptor, std::__1::allocatorrocksdb::ColumnFamilyDescriptor > const&, bool, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator >) (in /bin/clickhouse)
==3135795== by 0x19675E2A: rocksdb::DBImpl::Recover(std::__1::vector<rocksdb::ColumnFamilyDescriptor, std::__1::allocatorrocksdb::ColumnFamilyDescriptor > const&, bool, bool, bool, unsigned long
) (in /clickhouse)
==3135795== by 0x1967EA83: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, std::__1::vector<rocksdb::ColumnFamilyDescriptor, std::__1::allocatorrocksdb::ColumnFamilyDescriptor > const&, std::__1::vector<rocksdb::ColumnFamilyHandle*, std::__1::allocatorrocksdb::ColumnFamilyHandle* >*, rocksdb::DB**, bool, bool) (in /bin/clickhouse)
==3135795== by 0x1967DFFE: rocksdb::DB::Open(rocksdb::Options const&, std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, rocksdb::DB**) (in /bin/clickhouse)
==3135795== by 0x153D2B81: DB::MergeTreeMetadataCache::create(std::__1::basic_string<char, std::__1::char_traits, std::__1::allocator > const&, unsigned long) (in bin/clickhouse)

@alexey-milovidov
Copy link
Copy Markdown
Member

Yes, that's why we removed this feature a few years ago (and burned it).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.