Add metrics for part number in MergeTree in ClickHouse#17838
Add metrics for part number in MergeTree in ClickHouse#17838alexey-milovidov merged 17 commits intoClickHouse:masterfrom
Conversation
|
BTW, it's a summary metric, why not use |
When we are building the monitoring system of clickhouse, we find the number of parts is an important indicator of the overall situation of a cluster. We recommend adding this information to the default metrics just like It is worth mentioning that it is indeed possible to collect monitoring data from
|
We can also get information about |
|
@akuzm I have added functional tests for the code. Under normal circumstances, the number of Parts is the same as the data aggregated by |
|
@weeds085490 The code looks Ok but the tests deadlocked. I don't see the reason. |
Thanks for reviewing. I have merged with master. PTAL |
@alexey-milovidov |
|
| LOG_TRACE(log, "dropAllData: removing data from memory."); | ||
|
|
||
| DataPartsVector all_parts(data_parts_by_info.begin(), data_parts_by_info.end()); | ||
| DataPartsVector committed_parts = getDataPartsVector({DataPartState::Committed}); |
There was a problem hiding this comment.
Doesn't it involve recursive mutex locking (see a few lines above)?
There was a problem hiding this comment.
Doesn't it involve recursive mutex locking (see a few lines above)?
That did cause a deadlock. If I use an Ordinary database, the deadlock will definitely occur when dropping table.
@alexey-milovidov I have fixed it. PTAL.
|
@alexey-milovidov I have modified the code to make the monitored data consistent with the data obtained by the
It becomes consistent after the first query possibly caused by |
|
@weeds085490 |
During the SQL execution of clickhouse, the input data is not based on the same snapshot. For the same filter, we may get two different data in the same SQL when other threads are inserting new data. Let‘s look at execution log of the failed test: when executing the SQL: SELECT COUNT(1) FROM (SELECT SUM(IF(metric = 'Parts', value, 0)) AS Parts, SUM(IF(metric = 'PartsActive', value, 0)) AS PartsActive, SUM(IF(metric = 'PartsInactive', value, 0)) AS PartsInactive FROM system.metrics) as a INNER JOIN (SELECT toInt64(SUM(1)) AS Parts, toInt64(SUM(IF(active = 1, 1, 0))) AS PartsActive, toInt64(SUM(IF(active = 0, 1, 0))) AS PartsInactive FROM system.parts ) as b USING (Parts,PartsActive,PartsInactive);A new part: Then I do another test: Reducing the If I close metric_log, then this short-term inconsistency will never reproduce. @alexey-milovidov PTAL. |
@alexey-milovidov Hi, PTAL |
|
Ok but it means we need to rewrite the test in another way... |
|
@alexey-milovidov I have changed the test method. All functional tests have passed. PTAL |
|
Sorry, the test is still flaky, I will revert this PR and we need to resubmit it. |
| ) as b USING (Parts,PartsActive,PartsInactive)" | ||
|
|
||
| verify(){ | ||
| for _ in $(seq 1 10) |
There was a problem hiding this comment.
The test does not passed after merge - https://clickhouse-test-reports.s3.yandex.net/0/f91626e7ff352d87283e7b8dafed94fd6ef38c8d/fast_test.html#fail1
@weeds085490 can you please take a look? Maybe 10 retries is not enough?
|
It failed 4 out of 16 times. |
Ok. Let me take a look |
|
I will try to revive changes in #18955 |
|
@weeds085490 See #19122. |




I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add metrics(Parts, PartsActive, PartsInactive) for part number in MergeTree in clickhouse
Detailed description / Documentation draft:
PartsActivecorresponds to the commited parts.PartsInactivecorresponds to the parts which are not in commited status.Partscorresponds to the total number of parts in clickhouse.We need to deal with three scenarios to calculate
Parts、PartsInactive、PartsInactive.