Additional primary key scan for skip index FINAL queries#70210
Additional primary key scan for skip index FINAL queries#70210shiyer7474 wants to merge 27 commits intoClickHouse:masterfrom
Conversation
|
Couple of points to review in the solution :-
Test results (wrong) without the patch and using secondary skip index scan :- Correct test results with the patch (more rows scanned but much less than full table scan) |
…for_skip_index_final
src/Core/Settings.cpp
Outdated
|
|
||
| - 0 — Disabled. | ||
| - 1 — Enabled. | ||
| >= 2 - Limit on the number of granules returned by the skip index, beyond which feature is disabled. |
There was a problem hiding this comment.
Seems a bit unusual, is there any other setting that have this multiple special values together with simple semantics of numerical value?
Maybe reconsider this one?
There was a problem hiding this comment.
Sure, good point, I just went for a quick fix solution.
A knob is needed to turn ON/OFF the fix. It is perfectly safe to set the existing use_skip_indexes_if_final to 1 if it is known that the secondary index column value never mutates after insert (even if other column values are updated). Thus enabling this new fix by default will cause a performance regression as more rows could be scanned than current.
Some additional protection maybe required if the first step secondary index lookup returns 100's of granules. The KeyCondition for the 2nd step could then contain 100s of RPN nodes and take memory / CPU for the repeated searches. Hence some setting for exiting out of the fix is needed if there are too many ranges returned by the first step. An example of the KeyCondition where the primary key has 3 column and 12 granules are returned in the first pass :-
(column 0 in ['303ed4c698', '32b30a250a']), (column 1 in [80, 99]), and, (column 2 in [647080, 809299]), and,
(column 0 in ['5ea1649a31', '5fd0b37cd7']), (column 1 in [64, 83]), and, (column 2 in [537064, 114283]), and,
or, (column 0 in ['6da37dd313', '6f2268bd1d']), (column 1 in [17, 36]), and, (column 2 in [331617, 753836]), and,
or, (column 0 in ['7143d7fbad', '7380ad8a67']), (column 1 in [56, 75]), and, (column 2 in [781056, 497275]), and,
or, (column 0 in ['8f468c873a', '912d2b1c7b']), (column 1 in [5, 24]), and, (column 2 in [968605, 751824]), and,
or, (column 0 in ['c15da1f2b5', 'c32d9bf27a']), (column 1 in [47, 66]), and, (column 2 in [788247, 979466]), and,
or, (column 0 in ['5b8add2a5d', '5ea1649a31']), (column 1 in [44, 64]), and, (column 2 in [317844, 537064]), and,
or, (column 0 in ['a1d0c6e83f', 'a4a042cf4f']), (column 1 in [78, 97]), and, (column 2 in [42478, 171697]), and,
or, (column 0 in ['a64c94baaf', 'a87ff679a2']), (column 1 in [16, 36]), and, (column 2 in [898816, 4036]), and,
or, (column 0 in ['05049e90fa', '0640966322']), (column 1 in [57, 76]), and, (column 2 in [367657, 158876]), and,
or, (column 0 in ['7b13b22030', '7eabe3a164']), (column 1 in [13, 32]), and, (column 2 in [993613, 206832]), and,
or, (column 0 in ['f4be00279e', 'f7e9050c92']), (column 1 in [27, 65]), and, (column 2 in [528227, 854665]), and, or
I will rework this setting stuff.
|
This is an automated comment for commit 56fc3dc with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
| first_part_for_next_pass = part_index; /// we only need to check in newer parts | ||
| } | ||
| static constexpr size_t ranges_threshold = 1000; | ||
| if (range_count > ranges_threshold) |
There was a problem hiding this comment.
@nickitat Thank you picking up this PR for review! Let me know your overall feedback and if you have any queries. Specifically here, do you think this thresholding is required? I added this protection, but now feel that even if there are more ranges, the slightly additional cost of the "PK search" 2nd pass should be okay. For tables with billions of rows, this limit of 1000 ranges returned by secondary index looks really low.
There was a problem hiding this comment.
i think the main challenge here is to make it efficient for wider set of tables and queries. number of parts could be big (10^3-10^4), number of ranges even bigger by a couple of orders of magnitude.
we likely need a threshold to abandon the optimisation at some point. but before that we could:
- try to "compact" PK conditions. e.g. if we have conditions like
in [1; 3],[5; 8]- we could merge those into a single one -in [1; 8]. this way we could maintain number of conditions limited. also possible to decide at some point that we got too far and it won't filter anything and give up on the optimisation. - what i was talking about in the issue: we extract non-intersecting parts of PK and don't apply merging algorithm to them in runtime. seems like we also could apply skipping indices to them as if they were parts from ordinary MergeTree tables.
ClickHouse/src/Processors/QueryPlan/ReadFromMergeTree.cpp
Lines 1361 to 1375 in 380d55b
My understanding is that compaction/merge of adjacent ranges is already done by existing code in
Hmmm... |
|
The code in
This PR's current implementation of using a @nickitat Let me know your opinion! I am testing the above code implementation. |
|
The failures in the 5 fuzzer suites are same - the new code checks that the PK columns have strict weak ordering, the fuzzer changes the PK column to Nullable. The exception informs the user to rerun the query with use_skip_indexes_if_final_exact_mode=0. |
|
@nickitat Please review the latest upload! I have reworked the implementation to use a simple sort + intersect to find the expanded set of PK ranges. As a data point, the earlier |
|
Another alternative that I analyzed was : Step 1: Do not use skip indexes in the first pass Above is good if the proportion of non-intersecting ranges is high. But all the intersecting ranges will have to read. Comparing the above with what is implemented : Step 2 and Step 3 in the first solution kind of correspond to Step 2 and Step 1 in second solution, and same cost? Step 3 in the 2nd solution looks extra but should be minimal cost if the skip index was effective. Solution 1 could read a lot of ranges unnecessarily (the original intersecting set). The current solution can be improved by avoiding step 3 - step 2 itself can emit the intersecting and non-intersecting ranges when performing the merge! |
|
Only test failure in TSAN suite is in I am closing this PR, will reopen if any review activity comes up. |
|
@nickitat ping, could you please take a look? |
|
Thanks for including this in the 2025 roadmap! Let me complete the other approach and get some numbers. #70210 (comment) |
|
I had to refresh and merge to master and also move my git to a new VM. Latest CI run shows 2 job failures - and the failing tests are - Both failures are seen in other CI runs in the last week. |
|
I completed the current implementation (earlier version did not cover all cases for range intersection) and also performed extensive benchmarking of the current implementation and an alternate implementation based on sorting (similar to the algorithm in The alternate implementation is slower and has higher elapsed time compared to the current implementation (close to 50%). The additional time goes in sorting the single combined list of RangeStart's & RangeEnd's. Whereas in the current implementation, the sort of the separate "selected" ranges list and "rejected" ranges list (only RangeStart's) is faster because range starts within a part are already ordered. The code change to coalesce ranges in |
|
For some absolute numbers from sub-optimal usecases with random primary key - There has been a regression in the last 2 weeks. The same statement used to finish in the range of 3.4 - 3.8 seconds, so that is a regression of more than 50%. Following commit to introduce a skipping index cache looks to have caused this - Disabling the skip index cache degrades the performance even more - But running against 2 week old repo without the skip index cache commit , shows a much better elapsed time compared to skip index cache enabled or disabled- |
|
Above numbers from a 8-CPU (model name : Intel(R) Xeon(R) CPU E5-1630 v3 @ 3.70GHz) With 2.7 million granules (exact 2747434), the overhead of the feature is about 3.4 seconds (without the skip index cache feature) and there is a gain of close to a second further down in splitting part ranges into layers. That's just 2 1/2 seconds in the worst case! |
|
These 3 functions are on the top compared to a report of 2 weeks ago. |
|
On a 48-CPU VM with max_threads=48, the skipping index cache causes significant increase in execution timing -
That's just 2 seconds for looking up the minmax index and then performing the "PrimaryKeyExpand" procedure from this PR over close to 3M granules.
Table structure - |
|
Dear @nickitat, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
| DECLARE(Bool, use_skip_indexes_if_final_exact_mode, 0, R"( | ||
| Controls whether granules returned by a skipping index are expanded in newer parts to return correct results when executing a query with the FINAL modifier. | ||
|
|
||
| Using skip indexes may exclude rows (granules) containing the latest data which could lead to incorrect results. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. (Experimental) |
There was a problem hiding this comment.
| Using skip indexes may exclude rows (granules) containing the latest data which could lead to incorrect results. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. (Experimental) | |
| Using skip indexes may exclude rows (granules) containing the latest data which could lead to incorrect results. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. |
There was a problem hiding this comment.
ack, will correct in the next upload.
| if (add_index_stat_row_for_pk_expand) | ||
| { | ||
| result.index_stats.emplace_back(ReadFromMergeTree::IndexStat{ | ||
| .type = ReadFromMergeTree::IndexType::PrimaryKeyExpand, |
There was a problem hiding this comment.
IMO, it will confuse more than help. Let's use the Skip type and elaborate only in the description.
| MarkRange range; | ||
| size_t part_index; | ||
| EventType event; | ||
| bool selected; |
There was a problem hiding this comment.
It would be nice to have a comment here
There was a problem hiding this comment.
Good point, will add in the next upload.
| for (auto & all_part_range : all_part_ranges) | ||
| { | ||
| const auto & index_granularity = all_part_range.data_part->index_granularity; | ||
| all_part_range.ranges = MarkRanges{{MarkRange{0, index_granularity->getMarksCountWithoutFinal()}}}; |
There was a problem hiding this comment.
I'm a little confused by this code. If we filtered smth by PK condition, will we discard these filtering results and fall back to reading the whole part?
There was a problem hiding this comment.
If a part has been completely filtered out / rejected before skip index filtering, then we will not check ranges in that part for intersection. All parts which have 1 or more ranges shortlisted will under go this new range intersection procedure. I will show an example in the next upload.
|
|
||
| earliest_part_found = true; | ||
| } | ||
| if (!earliest_part_found) |
There was a problem hiding this comment.
I'm not getting the purpose of earliest_part_found and this condition. Could you elaborate, pls?
There was a problem hiding this comment.
It was an optimization (incorrect) for which I was looking for feedback. I have removed it.
| } | ||
| } | ||
|
|
||
| LOG_TRACE(logger, "findPKRangesForFinalAfterSkipIndex : processed {} parts, selected ranges {}, rejected ranges {}", ranges_in_data_parts.size(), selected_ranges.size(), rejected_ranges.size()); |
There was a problem hiding this comment.
for consistency
| LOG_TRACE(logger, "findPKRangesForFinalAfterSkipIndex : processed {} parts, selected ranges {}, rejected ranges {}", ranges_in_data_parts.size(), selected_ranges.size(), rejected_ranges.size()); | |
| LOG_TRACE(logger, "findPKRangesForFinalAfterSkipIndex : processed {} parts, selected {} ranges, rejected {} ranges", ranges_in_data_parts.size(), selected_ranges.size(), rejected_ranges.size()); |
and we probably better to log this at the very end, when we know the actual numbers. otherwise it will be very misleading.
There was a problem hiding this comment.
ack, will correct in the next upload.
| std::vector<PartsRangesIterator>::iterator selected_ranges_iter = selected_ranges.begin(); | ||
| std::vector<PartsRangesIterator>::iterator rejected_ranges_iter = rejected_ranges.begin(); | ||
|
|
||
| while (selected_ranges_iter != selected_ranges.end() && rejected_ranges_iter != rejected_ranges.end()) |
There was a problem hiding this comment.
We need to devise a way to test this logic more thoroughly.
There was a problem hiding this comment.
Sure, let's discuss.
|
|
||
| while (selected_ranges_iter != selected_ranges.end() && rejected_ranges_iter != rejected_ranges.end()) | ||
| { | ||
| auto start_value1 = selected_ranges_iter->value; |
There was a problem hiding this comment.
selected and rejected instead of 1 and 2 would really improve readability
There was a problem hiding this comment.
Yes, perfect, code looks better with variables named selected_range_start , rejected_range_start etc.
|
Thank you @nickitat , taking a look. |
|
I am closing this PR as I am unable to push commits into the PR branch from my new git account. I will shortly add the new PR number here, thanks! |
* u/pr/78041: Diagnostic Fix Fix crash Import VersionHistory and SettingsInfo components Skip the test for now Increase number of messages to avoid flakiness Fix spelling Fix spelling Comments and cleanup Move common code to separate functions Update src/Interpreters/Cache/FileCacheSettings.cpp Fix test prefer set of matching functions instead of std::function full_text and inverted indexes are not supported anymore clang-tidy fix Fix Fix Fix review comments & cleanup Sometimes generate a not matching structure Fix remove unneded statement Wrap scalar subquery type in nullable or tuple in only_analyze mode Code deduplication add test case Fix Fix ASAN/MSAN failure Fix: Duplicate announcement received for replica ... build: fix POST_BUILD for keeper in case of BUILD_STANDALONE_KEEPER Update test.py Update test.py Set correct error for Multi request in Keeper No test fix issues Update cache_dynamic_resize.xml Update KernelHelper.cpp REFRESH VIEWS seems to not work Typo Fix style check Use field Remove unused field Cleanup build: use POST_BUILD over separate targets for creating clickhouse symlinks Fix and new setting Support alter database on cluster docs(date): move sidebar position next to other date types docs(date32): fix example output Better test Unbreak test Better test Maybe better Miscellaneous Add a test Add a test better logging Fix loading of system logs in clickhouse-local better warning message Suppress QCC in isolation hermitage Remove stray semicolon Oops, that import wasn't unused address pr comments Fix clang-20 build Add retry logic to test_backup_restore_new::test_system_backups Clarify `low_priority_query_wait_time_ms` meaning Disable test for msan Fix log message fix build on osx 15.4 with command line tools 16.3 Fix typo Fix delta-kernel with http based endpoints, fix NOSIGN Improve comment generation Increase prob Update TableSnapshot.cpp Add a comment Move object schemas Merge master and update SettingsChangesHistory.cpp Remove no-random tag New setting Disable remote_filesystem_read_method, gives issues on cloud Empty commit to reroll CI Comment and less environment-dependent config Hopefully also fix '/test/re/max_log_ptr': node doesn't exist Fix znode_exists flakiness in test_refreshable_mv_in_replicated_db Adjust limits only for fast builds Preserve aggregation keys order after filter pushdown Allow connection pooler parameters to be changed dynamically. avoid duplication in the MergeTreeIndexCondition::alwaysUnknownOrTrue impls minor refactoring near MergeTree sinks Add a setting to allow dynamic cache resize Fix Clang-tidy fix Clang-tidy fix Fix flaky test 03364_estimate_compression_ratio Fix Cleanup with templates Always set prefix for S3 ListObject Add more exchange statements Fix test Fix hudi test Fix Rename objects and fix Fix and reduce prob Relax a test Update transactions.lib Relax tests Update SettingsChangesHistory.cpp Relax tests Preparation add back enable_analyzer else generateRandom() complains Test updates and additional check for used skip index Use system commands in multi-setting oracle Fix vsem privet Added v25.4 new features Delete reference file Delete flaky test 02421_type_json_empty_parts with old Object type Increase probability Clean parenthesis expressions and generate more cases clang tidy fix Generate small array/tuple with literal values Make it simple Don't make tuples too complicated size_in_cells is always required Update to spark 3.5 Extend test Missing space Hope I didnt break ClickHouse Give warning Look for infinite tables Add hot settings for oracle Extend test Extend test Fix columnMappingMode.name with partition columns, add tests for schema evolution Update 03407_parse_date_time_best_effort_unix_timestamp_with_fraction.sql Support unix timestapms with fractional part in best effort DateTime64 parsing Implement fail on timeout Fix for not deterministic engines no-random for 1 test and need new analyzer for other test Fix unit test Update test Make parquet version selection less confusing More tests Modified the description of PrimaryKeyExpandForFinal initialized vector<bool> not thread-safe Initialize skip_index_used_in_part Correctly identify ranges filtered by PK Spell-check Code review feedback and SettingsChangesHistory update Disable optimization if reverse sorted keys Good names for start_value1/end_value1 Moved over sources from PR ClickHouse#70210 Remove garbage from the query condition cache and enable it by default Fix refreshable materialized view not working on newly added replicas Typo Fix More comment Add iteration limit in QueryFuzzer Relax tests a little Relax some tests Relax some tests Relax some tests Relax some tests Fix test Maybe fix tests Fix test Fix test Update tests Fix tests Fix build Update tests Update test Fix tests Update test Update test Add no-coverage tag Update tests Limit tests even more
Background in #34243 and #31411. Patch will make sure that correct results are returned by a FINAL query using a skip index and should significantly improve performance of FINAL queries. Solution takes the initial set of PK ranges returned by the skip index and then scans those PK ranges across all unmerged partitions.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
New setting introduced : use_skip_indexes_in_final_exact_mode. If a query has FINAL clause, then skipping data based on indexes may produce incorrect result. This setting can ensure that correct results are returned by scanning newer parts that have overlap with the ranges returned by the skip index. Set to 0 to disable, 1 to enable.
CI Settings (Only check the boxes if you know what you are doing):