Additional primary key scan for skip index FINAL queries by shiyer7474 · Pull Request #70210 · ClickHouse/ClickHouse

shiyer7474 · 2024-10-01T16:42:12Z

Background in #34243 and #31411. Patch will make sure that correct results are returned by a FINAL query using a skip index and should significantly improve performance of FINAL queries. Solution takes the initial set of PK ranges returned by the skip index and then scans those PK ranges across all unmerged partitions.

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

New setting introduced : use_skip_indexes_in_final_exact_mode. If a query has FINAL clause, then skipping data based on indexes may produce incorrect result. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. Set to 0 to disable, 1 to enable.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

CI Settings (Only check the boxes if you know what you are doing):

Allow: All Required Checks
Allow: Stateless tests
Allow: Stateful tests
Allow: Integration Tests
Allow: Performance tests
Allow: All Builds
Allow: batch 1, 2 for multi-batch jobs
Allow: batch 3, 4, 5, 6 for multi-batch jobs

Exclude: Style check
Exclude: Fast test
Exclude: All with ASAN
Exclude: All with TSAN, MSAN, UBSAN, Coverage
Exclude: All with aarch64, release, debug

Run only fuzzers related jobs (libFuzzer fuzzers, AST fuzzers, etc.)
Exclude: AST fuzzers

shiyer7474 · 2024-10-02T05:53:26Z

Couple of points to review in the solution :-

Will the solution be generally useful? It depends, but there is a big room considering that the alternative is full table scan.
Additional cost of the 2nd pass? There is some CPU burnt in the primary key range scans, depends on the number of unmerged parts. Additional granules could definitely be read, based on application update patterns. But the key is that the solution guarantees correct results by any FINAL query with skip index.

Test results (wrong) without the patch and using secondary skip index scan :-

SET use_skip_indexes = 1;
SET use_skip_indexes_if_final = 1;

SELECT 'Next 4 queries should return 0 rows and the 5th query should return 1 row'

Query id: b46ddcb1-9ce2-4b7d-9213-185ddf53d3d3

   ┌─'Next 4 queries should return 0 rows and the 5th query should return 1 row'─┐
1. │ Next 4 queries should return 0 rows and the 5th query should return 1 row   │
   └─────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.003 sec.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 98889991
SETTINGS skip_indexes_in_final_correctness_threshold = 0

Query id: b3fd7adc-1205-4326-a5fc-3c60bfe7396d

   ┌─count(id)─┐
1. │         1 │
   └───────────┘

1 row in set. Elapsed: 0.011 sec. Processed 8.19 thousand rows, 131.07 KB (772.64 thousand rows/s., 12.36 MB/s.)
Peak memory usage: 403.79 KiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 88889992
SETTINGS skip_indexes_in_final_correctness_threshold = 0

Query id: b4326272-df2f-4837-b352-3da66950ba40

   ┌─count(id)─┐
1. │         1 │
   └───────────┘

1 row in set. Elapsed: 0.013 sec. Processed 16.38 thousand rows, 262.14 KB (1.25 million rows/s., 20.02 MB/s.)
Peak memory usage: 423.87 KiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 78889993
SETTINGS skip_indexes_in_final_correctness_threshold = 0

Query id: fc0257b4-4a41-43ba-a713-7515852e9517

   ┌─count(id)─┐
1. │         1 │
   └───────────┘

1 row in set. Elapsed: 0.018 sec. Processed 24.58 thousand rows, 393.22 KB (1.37 million rows/s., 21.86 MB/s.)
Peak memory usage: 550.71 KiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 68889994
SETTINGS skip_indexes_in_final_correctness_threshold = 0

Query id: 8ca18045-7851-4a04-af5b-a2c2a917f6d4

   ┌─count(id)─┐
1. │         1 │
   └───────────┘

1 row in set. Elapsed: 0.019 sec. Processed 32.77 thousand rows, 524.29 KB (1.69 million rows/s., 27.07 MB/s.)
Peak memory usage: 555.95 KiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 58889995
SETTINGS skip_indexes_in_final_correctness_threshold = 0

Query id: 3a51d3d2-2344-4014-913c-c32349ae1767

   ┌─count(id)─┐
1. │         1 │
   └───────────┘

1 row in set. Elapsed: 0.017 sec. Processed 40.96 thousand rows, 655.36 KB (2.39 million rows/s., 38.26 MB/s.)
Peak memory usage: 610.93 KiB.

Correct test results with the patch (more rows scanned but much less than full table scan)

SELECT 'Next 4 queries should return 0 rows and the 5th query should return 1 row'

Query id: 7de4c8a6-055d-4db9-8169-52242351eee2

   ┌─'Next 4 queries should return 0 rows and the 5th query should return 1 row'─┐
1. │ Next 4 queries should return 0 rows and the 5th query should return 1 row   │
   └─────────────────────────────────────────────────────────────────────────────┘

1 row in set. Elapsed: 0.002 sec.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 98889991
SETTINGS skip_indexes_in_final_correctness_threshold = 1

Query id: d7a39d86-697f-48aa-bc40-a55802fea539

   ┌─count(id)─┐
1. │         0 │
   └───────────┘

1 row in set. Elapsed: 0.025 sec. Processed 139.26 thousand rows, 2.23 MB (5.57 million rows/s., 89.12 MB/s.)
Peak memory usage: 1.55 MiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 88889992
SETTINGS skip_indexes_in_final_correctness_threshold = 1

Query id: 53a575cc-4640-49c1-8c5e-8ab4f416fd78

   ┌─count(id)─┐
1. │         0 │
   └───────────┘

1 row in set. Elapsed: 0.025 sec. Processed 139.26 thousand rows, 2.23 MB (5.64 million rows/s., 90.16 MB/s.)
Peak memory usage: 1.55 MiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 78889993
SETTINGS skip_indexes_in_final_correctness_threshold = 1

Query id: dc6e3a5c-507f-4844-a9df-e3695215e00f

   ┌─count(id)─┐
1. │         0 │
   └───────────┘

1 row in set. Elapsed: 0.027 sec. Processed 139.26 thousand rows, 2.23 MB (5.14 million rows/s., 82.18 MB/s.)
Peak memory usage: 1.71 MiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 68889994
SETTINGS skip_indexes_in_final_correctness_threshold = 1

Query id: eada4a69-f385-4031-a460-19f7dea95560

   ┌─count(id)─┐
1. │         0 │
   └───────────┘

1 row in set. Elapsed: 0.029 sec. Processed 139.26 thousand rows, 2.23 MB (4.84 million rows/s., 77.48 MB/s.)
Peak memory usage: 1.70 MiB.


SELECT count(id)
FROM t_final_query_tbl
FINAL
WHERE v = 58889995
SETTINGS skip_indexes_in_final_correctness_threshold = 1

Query id: d11acbc5-acb0-40f0-8079-882dc46946a9

   ┌─count(id)─┐
1. │         1 │
   └───────────┘

…for_skip_index_final

Enmk · 2024-10-08T14:05:09Z

src/Core/Settings.cpp

+
+- 0 — Disabled.
+- 1 — Enabled.
+>= 2 - Limit on the number of granules returned by the skip index, beyond which feature is disabled.


Seems a bit unusual, is there any other setting that have this multiple special values together with simple semantics of numerical value?

Maybe reconsider this one?

Sure, good point, I just went for a quick fix solution.

A knob is needed to turn ON/OFF the fix. It is perfectly safe to set the existing use_skip_indexes_if_final to 1 if it is known that the secondary index column value never mutates after insert (even if other column values are updated). Thus enabling this new fix by default will cause a performance regression as more rows could be scanned than current.

Some additional protection maybe required if the first step secondary index lookup returns 100's of granules. The KeyCondition for the 2nd step could then contain 100s of RPN nodes and take memory / CPU for the repeated searches. Hence some setting for exiting out of the fix is needed if there are too many ranges returned by the first step. An example of the KeyCondition where the primary key has 3 column and 12 granules are returned in the first pass :-

(column 0 in ['303ed4c698', '32b30a250a']), (column 1 in [80, 99]), and, (column 2 in [647080, 809299]), and, (column 0 in ['5ea1649a31', '5fd0b37cd7']), (column 1 in [64, 83]), and, (column 2 in [537064, 114283]), and, or, (column 0 in ['6da37dd313', '6f2268bd1d']), (column 1 in [17, 36]), and, (column 2 in [331617, 753836]), and, or, (column 0 in ['7143d7fbad', '7380ad8a67']), (column 1 in [56, 75]), and, (column 2 in [781056, 497275]), and, or, (column 0 in ['8f468c873a', '912d2b1c7b']), (column 1 in [5, 24]), and, (column 2 in [968605, 751824]), and, or, (column 0 in ['c15da1f2b5', 'c32d9bf27a']), (column 1 in [47, 66]), and, (column 2 in [788247, 979466]), and, or, (column 0 in ['5b8add2a5d', '5ea1649a31']), (column 1 in [44, 64]), and, (column 2 in [317844, 537064]), and, or, (column 0 in ['a1d0c6e83f', 'a4a042cf4f']), (column 1 in [78, 97]), and, (column 2 in [42478, 171697]), and, or, (column 0 in ['a64c94baaf', 'a87ff679a2']), (column 1 in [16, 36]), and, (column 2 in [898816, 4036]), and, or, (column 0 in ['05049e90fa', '0640966322']), (column 1 in [57, 76]), and, (column 2 in [367657, 158876]), and, or, (column 0 in ['7b13b22030', '7eabe3a164']), (column 1 in [13, 32]), and, (column 2 in [993613, 206832]), and, or, (column 0 in ['f4be00279e', 'f7e9050c92']), (column 1 in [27, 65]), and, (column 2 in [528227, 854665]), and, or

I will rework this setting stuff.

robot-clickhouse · 2024-10-14T11:50:54Z

This is an automated comment for commit 56fc3dc with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	❌ failure
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	❌ failure

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (asan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (debug)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (msan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (tsan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (ubsan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs ClickBench with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integration tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

…for_skip_index_final

shiyer7474 · 2024-10-21T07:22:40Z

src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp

+                        first_part_for_next_pass = part_index; /// we only need to check in newer parts
+                }
+                static constexpr size_t ranges_threshold = 1000;
+                if (range_count > ranges_threshold)


@nickitat Thank you picking up this PR for review! Let me know your overall feedback and if you have any queries. Specifically here, do you think this thresholding is required? I added this protection, but now feel that even if there are more ranges, the slightly additional cost of the "PK search" 2nd pass should be okay. For tables with billions of rows, this limit of 1000 ranges returned by secondary index looks really low.

i think the main challenge here is to make it efficient for wider set of tables and queries. number of parts could be big (10^3-10^4), number of ranges even bigger by a couple of orders of magnitude.
we likely need a threshold to abandon the optimisation at some point. but before that we could:

try to "compact" PK conditions. e.g. if we have conditions like in [1; 3], [5; 8] - we could merge those into a single one - in [1; 8]. this way we could maintain number of conditions limited. also possible to decide at some point that we got too far and it won't filter anything and give up on the optimisation.

what i was talking about in the issue: we extract non-intersecting parts of PK and don't apply merging algorithm to them in runtime.

ClickHouse/src/Processors/QueryPlan/ReadFromMergeTree.cpp

Lines 1361 to 1375 in 380d55b

SplitPartsWithRangesByPrimaryKeyResult split_ranges_result = splitPartsWithRangesByPrimaryKey(

storage_snapshot->metadata->getPrimaryKey(),

sorting_expr,

std::move(new_parts),

num_streams,

context,

std::move(in_order_reading_step_getter),

split_parts_ranges_into_intersecting_and_non_intersecting_final,

settings[Setting::split_intersecting_parts_ranges_into_layers_final]);

for (auto && non_intersecting_parts_range : split_ranges_result.non_intersecting_parts_ranges)

non_intersecting_parts_by_primary_key.push_back(std::move(non_intersecting_parts_range));

for (auto && merging_pipe : split_ranges_result.merging_pipes)

pipes.push_back(std::move(merging_pipe));

seems like we also could apply skipping indices to them as if they were parts from ordinary MergeTree tables.

shiyer7474 · 2024-10-23T10:52:36Z

try to "compact" PK conditions. e.g. if we have conditions like in [1; 3], [5; 8] - we could merge those into a single one - in [1; 8]. this way we could maintain number of conditions limited. also possible to decide at some point that we got too far and it won't filter anything and give up on the optimisation.

My understanding is that compaction/merge of adjacent ranges is already done by existing code in MergeTreeDataSelectExecutor::filterMarksUsingIndex(). Compaction of "near" ranges is controlled by the merge_tree_min_rows_for_seek engine setting. Examples of the KeyCondition generated by the new code when adjacent ranges are shortlisted by the skip index -

//only 1 range :-
select id from x FINAL where v = 1200  settings use_skip_indexes_if_final_exact_mode=1;
2024.10.23 10:23:18.536942 [ 1209207 ] {4584911a-cc9f-4748-bee9-9b98c7a2f883} new KeyCondition (column 0 in [0, 8192])

//2 ranges :-
select id from x FINAL where v = 1200  or v = 9200 settings use_skip_indexes_if_final_exact_mode=1;
2024.10.23 10:23:28.514977 [ 1209207 ] {baf3ac1b-4456-432a-81e8-15b8b7908d7f} new KeyCondition (column 0 in [0, 16384])

//3 ranges :-
select id from x FINAL where v = 1200  or v = 9200 or v = 20100 settings use_skip_indexes_if_final_exact_mode=1;
2024.10.23 10:23:48.000960 [ 1209207 ] {2b4fd48d-d2af-4eb2-9b8c-afeb3c6064d5} new KeyCondition (column 0 in [0, 24576])

what i was talking about in the issue: we extract non-intersecting parts of PK and don't apply merging algorithm to them in runtime...seems like we also could apply skipping indices to them as if they were parts from ordinary MergeTree tables.

we now able to skip final merging for ranges with non-intersecting PK and for them we don't need this special procedure at all.

Hmmm...splitPartsWithRangesByPrimaryKey() is executed after the skip index was looked up and an incorrect/incomplete set of ranges was returned by filterMarksUsingIndex() . Thus at this point, the intersecting and non-intersecting PK ranges are also incorrect/incomplete. Let me try out something here along the lines of .splitPartsWithRangesByPrimaryKey(), I will get back in a day. Thanks!

shiyer7474 · 2024-10-25T11:35:29Z

The code in splitPartsWithRangesByPrimaryKey() is pretty complex, but it has close resemblance to what this PR is trying to to :-

splitPartsWithRangesByPrimaryKey() -> Given a collection of parts + ranges, split into intersecting-PK and non-intersecting-PK ranges.

This PR -> Given a set of shortlisted parts + ranges (via skip index), find intersecting-PK ranges in all rejected ranges /parts

splitPartsWithRangesByPrimaryKey() has multiple code paths because it has to handle ranges that cover more than 1 granule. In fact a range could cover an entire part of 100s of granules.

This PR's current implementation of using a KeyCondition representing the shortlisted ranges (could be 100s/1000s) and then repeated RPN evaluation over the newer parts/ranges can be replaced with similar to how splitPartsWithRangesByPrimaryKey() is implemented. A quick sketch of what I have implemented -

[skip_index_selected_ranges, skip_index_rejected_ranges] = MergeTreeDataSelectExecutor::filterPartsByPrimaryKeyAndSkipIndexes()

std::sort(skip_index_selected_ranges.begin(), skip_index_selected_ranges.end()); // sort by PK value in range.start
std::sort(skip_index_rejected_ranges.begin(), skip_index_rejected_ranges.end()); // sort by PK value in range.start

run a single pass INTERSECT traversing skip_index_selected_ranges and skip_index_rejected_ranges

@nickitat Let me know your opinion! I am testing the above code implementation.

…for_skip_index_final

shiyer7474 · 2024-10-28T14:13:39Z

The failures in the 5 fuzzer suites are same - the new code checks that the PK columns have strict weak ordering, the fuzzer changes the PK column to Nullable. The exception informs the user to rerun the query with use_skip_indexes_if_final_exact_mode=0.

CREATE TABLE t_final_query_tbl__fuzz_24
(
    `id` Nullable(UInt32),
    `v` DateTime,
    INDEX secondaryidx v TYPE minmax GRANULARITY 1
)
ENGINE = ReplacingMergeTree
ORDER BY id

Query id: 79f2bbb1-b381-4923-915c-703a4f970032


Elapsed: 0.050 sec.

Received exception from server (version 24.10.1):
Code: 44. DB::Exception: Received from localhost:9000. DB::Exception: Sorting key contains nullable columns, but merge tree setting `allow_nullable_key` is disabled. (ILLEGAL_COLUMN)

CREATE TABLE t_final_query_tbl__fuzz_24
(
    `id` Nullable(UInt32),
    `v` DateTime,
    INDEX secondaryidx v TYPE minmax GRANULARITY 1
)
ENGINE = ReplacingMergeTree
ORDER BY id
SETTINGS allow_nullable_key = 1

Query id: c679c6a1-5159-4315-b694-936e950f9097

Ok.

0 rows in set. Elapsed: 0.016 sec.

SELECT id
FROM t_final_query_tbl__fuzz_24
FINAL
WHERE v = 5
SETTINGS use_skip_indexes_if_final = 1, use_skip_indexes_if_final_exact_mode = 1

Query id: 5bacf28b-449b-41d0-a3d2-a95eef4e1831


Elapsed: 0.054 sec.

Received exception from server (version 24.10.1):
Code: 49. DB::Exception: Received from localhost:9000. DB::Exception: Primary key cannot be used in this query for expanding skip index results. Please reexecute query with setting use_skip_indexes_if_final = 0 & use_skip_indexes_if_final_exact_mode = 0.. (LOGICAL_ERROR)

shiyer7474 · 2024-10-28T15:36:16Z

@nickitat Please review the latest upload! I have reworked the implementation to use a simple sort + intersect to find the expanded set of PK ranges. As a data point, the earlier KeyCondition based implementation took around 35 seconds when the KeyCondition had 10000+ PK granules to compare with about 65000+ granules (PK with 3 columns). The same case takes sub-second in the new implementation. I have explained the test failure above, trying to figure out best fix for that (could just do a union of the selected & rejected ranges if PK does not have strict weak ordering)

shiyer7474 · 2024-11-12T11:43:28Z

Another alternative that I analyzed was :

Step 1: Do not use skip indexes in the first pass
Step 2: Fall into existing splitPartsWithRangesByPrimaryKey() to split the entire granules set (all parts/all granules) into intersecting & non-intersecting ranges. (Implementation uses sort and traverse logic)
Step 3: New 2nd pass - Use the skip index only on the non-intersecting ranges to shortlist matching ranges and discard other ranges. I don't think the skip index can be used on the intersecting ranges for further trimming(It would be somewhat like find intersecting ranges from intersecting ranges).

Above is good if the proportion of non-intersecting ranges is high. But all the intersecting ranges will have to read.

Comparing the above with what is implemented :
Step 1: Use skip index and shortlist ranges from all parts/ranges
Step 2: Perform intersection of shortlisted ranges with ranges from newer parts and collect the final set of ranges to be read. (Implementation uses sort and merge)
Step 3: Fall into existing splitPartsWithRangesByPrimaryKey()

Step 2 and Step 3 in the first solution kind of correspond to Step 2 and Step 1 in second solution, and same cost? Step 3 in the 2nd solution looks extra but should be minimal cost if the skip index was effective. Solution 1 could read a lot of ranges unnecessarily (the original intersecting set). The current solution can be improved by avoiding step 3 - step 2 itself can emit the intersecting and non-intersecting ranges when performing the merge!

One more data point with large dataset :
EXPLAIN indexes = 1
SELECT *
FROM t_final_query_tbl2
FINAL
WHERE v = 2111111111111111111
SETTINGS use_skip_indexes_if_final = 1, use_skip_indexes_if_final_exact_mode = 1

Query id: e3422b9d-10b5-4ac1-af2b-63e53b9cea63

    ┌─explain────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                      │
 2. │   Filter ((WHERE + Change column names to column identifiers)) │
 3. │     ReadFromMergeTree (default.t_final_query_tbl2)             │
 4. │     Indexes:                                                   │
 5. │       PrimaryKey                                               │
 6. │         Condition: true                                        │
 7. │         Parts: 988/988                                         │
 8. │         Granules: 537163/537163                                │
 9. │       Skip                                                     │
10. │         Name: secondaryidx                                     │
11. │         Description: bloom_filter GRANULARITY 1                │
12. │         Parts: 988/988                                         │
13. │         Granules: 13054/537163                                 │
14. │       PrimaryKeyExpand                                         │
15. │         Description: PrimaryKeyExpandForFinal                  │
16. │         Parts: 988/988                                         │
17. │         Granules: 537131/13054                                 │
    └────────────────────────────────────────────────────────────────┘

17 rows in set. Elapsed: 1.577 sec.

shiyer7474 · 2024-11-12T11:45:27Z

Only test failure in TSAN suite is in 01168_mutations_isolation #67842

I am closing this PR, will reopen if any review activity comes up.

Enmk · 2024-12-09T08:48:54Z

@nickitat ping, could you please take a look?

shiyer7474 · 2025-01-02T03:53:37Z

Thanks for including this in the 2025 roadmap! Let me complete the other approach and get some numbers. #70210 (comment)

…for_skip_index_final

shiyer7474 · 2025-01-13T04:07:07Z

I had to refresh and merge to master and also move my git to a new VM.

Latest CI run shows 2 job failures -
Integration tests (aarch64) [5/6] failure integration_tests__aarch64__[5_6].html
Stateless tests (ubsan) [2/2] failure stateless_tests__ubsan__[2_2].html

and the failing tests are -

test_storage_s3_queue/test.py::test_list_and_delete_race
02998_native_parquet_reader

Both failures are seen in other CI runs in the last week.

shiyer7474 · 2025-01-21T03:02:34Z

I completed the current implementation (earlier version did not cover all cases for range intersection) and also performed extensive benchmarking of the current implementation and an alternate implementation based on sorting (similar to the algorithm in splitPartsRanges(). The alternate implementation can be seen in an earlier commit :-

56fc3dc

The alternate implementation is slower and has higher elapsed time compared to the current implementation (close to 50%). The additional time goes in sorting the single combined list of RangeStart's & RangeEnd's. Whereas in the current implementation, the sort of the separate "selected" ranges list and "rejected" ranges list (only RangeStart's) is faster because range starts within a part are already ordered.

The code change to coalesce ranges in RangesInDataPartsBuilder::addRange() in the latest upload helps in reducing the time spent later in splitIntersectingPartsRangesIntoLayers(). Some numbers in the next comment.

shiyer7474 · 2025-01-21T03:13:11Z

For some absolute numbers from sub-optimal usecases with random primary key -

$ explain indexes=1  SELECT count(id1) FROM x FINAL where v < 800000 SETTINGS use_skip_indexes_if_final_exact_mode=1,use_skip_indexes=1,use_skip_indexes_if_final=1;

EXPLAIN indexes = 1
SELECT count(id1)
FROM x
FINAL
WHERE v < 800000
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes = 1, use_skip_indexes_if_final = 1

Query id: 67716f36-4491-4521-accb-5e5254021cf7

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.x)                              │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 1338/1338                                       │
10. │             Granules: 2929978/2929978                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 15/1338                                         │
15. │             Granules: 26820/2929978                                │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 1338/15                                         │
19. │             Granules: 2747434/26820                                │
    └────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 5.757 sec.

There has been a regression in the last 2 weeks. The same statement used to finish in the range of 3.4 - 3.8 seconds, so that is a regression of more than 50%. Following commit to introduce a skipping index cache looks to have caused this -

#70102

Disabling the skip index cache degrades the performance even more -

mysql9 :) SELECT metric, value FROM system.metrics WHERE metric = 'SkippingIndexCacheSize';

SELECT
    metric,
    value
FROM system.metrics
WHERE metric = 'SkippingIndexCacheSize'

Query id: d9e9beb1-feb3-428d-962a-f38eee7e996d

   ┌─metric─────────────────┬─value─┐
1. │ SkippingIndexCacheSize │     0 │
   └────────────────────────┴───────┘

1 row in set. Elapsed: 0.003 sec.

mysql9 :) explain indexes=1  SELECT count(id1) FROM x FINAL where v < 800000 SETTINGS use_skip_indexes_if_final_exact_mode=1,use_skip_indexes=1,use_skip_indexes_if_final=1;

EXPLAIN indexes = 1
SELECT count(id1)
FROM x
FINAL
WHERE v < 800000
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes = 1, use_skip_indexes_if_final = 1

Query id: 612bdf19-b6ea-4b83-b6b8-420e24eedc6d

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.x)                              │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 1338/1338                                       │
10. │             Granules: 2929978/2929978                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 15/1338                                         │
15. │             Granules: 26820/2929978                                │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 1338/15                                         │
19. │             Granules: 2747434/26820                                │
    └────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 9.534 sec.

But running against 2 week old repo without the skip index cache commit , shows a much better elapsed time compared to skip index cache enabled or disabled-

mysql9 :) SELECT metric, value FROM system.metrics WHERE metric = 'SkippingIndexCacheSize';

SELECT
    metric,
    value
FROM system.metrics
WHERE metric = 'SkippingIndexCacheSize'

Query id: 7b31f3e4-b374-48e3-9343-1c00586df00a

Ok.

0 rows in set. Elapsed: 0.002 sec.

mysql9 :) explain indexes=1  SELECT count(id1) FROM x FINAL where v < 800000 SETTINGS use_skip_indexes_if_final_exact_mode=1,use_skip_indexes=1,use_skip_indexes_if_final=1;

EXPLAIN indexes = 1
SELECT count(id1)
FROM x
FINAL
WHERE v < 800000
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes = 1, use_skip_indexes_if_final = 1

Query id: 064918e1-54dd-4215-ab74-9f754fe36e32

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.x)                              │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 1338/1338                                       │
10. │             Granules: 2929978/2929978                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 15/1338                                         │
15. │             Granules: 26820/2929978                                │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 1338/15                                         │
19. │             Granules: 2747434/26820                                │
    └────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 3.463 sec.

shiyer7474 · 2025-01-21T03:18:27Z

Above numbers from a 8-CPU (model name : Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz)

With 2.7 million granules (exact 2747434), the overhead of the feature is about 3.4 seconds (without the skip index cache feature) and there is a gain of close to a second further down in splitting part ranges into layers. That's just 2 1/2 seconds in the worst case!

shiyer7474 · 2025-01-21T03:21:02Z

These 3 functions are on the top compared to a report of 2 weeks ago.

     7.11%  MergeTreeIndex  [kernel.kallsyms]  [k] try_to_wake_up
     5.26%  MergeTreeIndex  clickhouse         [.] DB::SLRUCachePolicy<wide::integer<128ul, unsigned int>, DB::SkippingIndexCacheCell,
 UInt128TrivialHash, DB::SkippingIndexCacheWeightFunction>::get
     5.18%  MergeTreeIndex  [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack

shiyer7474 · 2025-01-21T15:28:25Z

On a 48-CPU VM with max_threads=48, the skipping index cache causes significant increase in execution timing -

On older git commit
commit 6b30301 (HEAD)
Merge: 3ddd7c9 06eeb53
Author: Alexey Milovidov [email protected]
Date: Sat Jan 11 05:30:43 2025 +0100

 explain indexes=1 SELECT count(id1) FROM x FINAL where v < 2000000 SETTINGS use_skip_indexes_if_final_exact_mode=1,use_skip_indexes=1,use_skip_indexes_if_final=1;

EXPLAIN indexes = 1
SELECT count(id1)
FROM x
FINAL
WHERE v < 2000000
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes = 1, use_skip_indexes_if_final = 1

Query id: 3a21c2e1-51c0-4a15-8961-932f19b850f9

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 3. │   Aggregating                                                      │
 4. │     Expression (Before GROUP BY)                                   │
 5. │       Filter ((WHERE + Change column names to column identifiers)) │
 6. │         ReadFromMergeTree (default.x)                              │
 7. │         Indexes:                                                   │
 8. │           PrimaryKey                                               │
 9. │             Condition: true                                        │
 10. │             Parts: 1350/1350                                       │
11. │             Granules: 2929980/2929980                              │
12. │           Skip                                                     │
13. │             Name: secondaryidx                                     │
14. │             Description: minmax GRANULARITY 1                      │
15. │             Parts: 30/1350                                         │
16. │             Granules: 61095/2929980                                │
17. │           PrimaryKeyExpand                                         │
18. │             Description: PrimaryKeyExpandForFinal                  │
19. │             Parts: 1350/30                                         │
20. │             Granules: 2929230/61095                                │
    └────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 2.073 sec.

That's just 2 seconds for looking up the minmax index and then performing the "PrimaryKeyExpand" procedure from this PR over close to 3M granules.

With skipping_index_cache enabled and default value -

explain indexes=1 SELECT count(id1) FROM x FINAL where v < 2000000 SETTINGS use_skip_indexes_if_final_exact_mode=1,use_skip_indexes=1,use_skip_indexes_if_final=1;

EXPLAIN indexes = 1
SELECT count(id1)
FROM x
FINAL
WHERE v < 2000000
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes = 1, use_skip_indexes_if_final = 1

Query id: 0ccaebef-3250-466d-ae0b-982adb63a1e8

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.x)                              │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 1338/1338                                       │
10. │             Granules: 2929978/2929978                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 30/1338                                         │
15. │             Granules: 61095/2929978                                │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 1338/30                                         │
19. │             Granules: 2929227/61095                                │
    └────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 8.596 sec.

With skipping index cache disabled :-

explain indexes=1 SELECT count(id1) FROM x FINAL where v < 2000000 SETTINGS use_skip_indexes_if_final_exact_mode=1,use_skip_indexes=1,use_skip_indexes_if_final=1;

EXPLAIN indexes = 1
SELECT count(id1)
FROM x
FINAL
WHERE v < 2000000
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes = 1, use_skip_indexes_if_final = 1

Query id: e012bab5-d46c-48bb-9906-e81b20511ec4

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.x)                              │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 1314/1314                                       │
10. │             Granules: 2929974/2929974                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 29/1314                                         │
15. │             Granules: 62654/2929974                                │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 1314/29                                         │
19. │             Granules: 2929417/62654                                │
    └────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 16.828 sec.

Table structure -

SHOW CREATE TABLE x

Query id: 4ed37db4-452a-4713-a5f2-459f49ca2aab

Row 1:
──────
statement: CREATE TABLE default.x
(
    `id1` String,
    `id2` UInt64,
    `id3` DateTime,
    `v` UInt64,
    INDEX secondaryidx v TYPE minmax GRANULARITY 1
)
ENGINE = ReplacingMergeTree
ORDER BY (id1, id2, id3)

clickhouse-gh · 2025-02-25T13:12:50Z

Dear @nickitat, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

nickitat · 2025-03-14T14:34:49Z

src/Core/Settings.cpp

+    DECLARE(Bool, use_skip_indexes_if_final_exact_mode, 0, R"(
+Controls whether granules returned by a skipping index are expanded in newer parts to return correct results when executing a query with the FINAL modifier.
+
+Using skip indexes may exclude rows (granules) containing the latest data which could lead to incorrect results. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. (Experimental)


Suggested change

Using skip indexes may exclude rows (granules) containing the latest data which could lead to incorrect results. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. (Experimental)

Using skip indexes may exclude rows (granules) containing the latest data which could lead to incorrect results. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index.

ack, will correct in the next upload.

nickitat · 2025-03-14T14:51:14Z

src/Processors/QueryPlan/ReadFromMergeTree.cpp

+    if (add_index_stat_row_for_pk_expand)
+    {
+        result.index_stats.emplace_back(ReadFromMergeTree::IndexStat{
+            .type = ReadFromMergeTree::IndexType::PrimaryKeyExpand,


IMO, it will confuse more than help. Let's use the Skip type and elaborate only in the description.

Let me check.

nickitat · 2025-03-18T19:57:02Z