Skip to content

Optimize ORDER BY...LIMIT N queries by skip index and dynamic threshold filter#89835

Merged
shankar-iyer merged 40 commits intoClickHouse:masterfrom
shankar-iyer:top_n_optimization_by_skip_index_and_prewhere
Dec 7, 2025
Merged

Optimize ORDER BY...LIMIT N queries by skip index and dynamic threshold filter#89835
shankar-iyer merged 40 commits intoClickHouse:masterfrom
shankar-iyer:top_n_optimization_by_skip_index_and_prewhere

Conversation

@shankar-iyer
Copy link
Copy Markdown
Member

@shankar-iyer shankar-iyer commented Nov 11, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Optimize ORDER BY...LIMIT N queries by using skip index and dynamic threshold filter to significantly reduce rows processed.

Details

Implements optimizations for top N query processing. All have a common theme -

  1. If a query does not have any predicates and minmax index exists on the ORDER BY column, use the minmax index to shortlist top qualifying granules e.g only 400K rows processed out of 100M below -
SELECT URL, EventTime
FROM hits_indexed
ORDER BY EventTime
LIMIT 10 
SETTINGS use_skip_indexes_for_top_n=1;

10 rows in set. Elapsed: 0.011 sec. Processed 477.66 thousand rows, 5.73 MB 
  1. [Open/TODO] Extension to above: If a query has predicates and primary index / skip indexes can identify granules that will definitely contain at least 1 row matching the predicate, proceed with the above optimization. e.g range scan on primary key, text index lookup (e.g SELECT * FROM logs WHERE hasToken(message, 'OOM') ORDER BY timestamp LIMIT 10. This item needs some work and discussions.
  2. For general queries (not in 1 and 2), implements a dynamic top-N threshold filtering using the minmax skip index. As the query executes and top-N values keep evolving, the current Nth value (or threshold) can be used to lookup the minmax index and totally skip granules that cannot enhance the top-N resultset. e.g
SELECT url, EventTime
FROM hits_indexed
WHERE url LIKE '%google%' 
ORDER BY EventTime
LIMIT 10;

10 rows in set. Elapsed: 0.280 sec. Processed 100.00 million rows, 9.91 GB

SELECT URL, EventTime
FROM hits_indexed
WHERE URL LIKE '%google%' 
ORDER BY EventTime LIMIT 10 
SETTINGS use_skip_indexes_for_top_n=1, use_top_n_dynamic_filtering=1, use_skip_indexes_on_data_read=1;
10 rows in set. Elapsed: 0.040 sec. Processed 10.08 million rows, 674.31 MB
  1. Next step implemented in PR - If a minmax skip index is not present, still use the top-N threshold to execute a PREWHERE on the sorting column. This optimization will be useful if there are other costly predicates in the query and thus evaluating the dynamic PREWHERE and rejecting rows could turn out cheaper. This was initially proposed in WIP some perf optimizations #81944

I still need to fix some issues, high level and low level feedback welcome!

@shankar-iyer shankar-iyer marked this pull request as draft November 11, 2025 08:50
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Nov 11, 2025

Workflow [PR], commit [5f997d2]

Summary:

job_name test_name status info comment
Stateless tests (amd_tsan, s3 storage, sequential, 2/2) failure
03708_flush_async_insert_queue_for_table FAIL cidb
AST fuzzer (amd_tsan) failure
Logical error: Cannot fold actions for projection. Node A requires input B which does not belong to projection' FAIL cidb
BuzzHouse (amd_debug) failure
Logical error: Inconsistent AST formatting: the query: FAIL cidb
BuzzHouse (amd_ubsan) failure
UndefinedBehaviorSanitizer: undefined behavior (STID: 3075-3638) FAIL cidb
Stateless tests (arm_asan, targeted) error

@shankar-iyer
Copy link
Copy Markdown
Member Author

@alexey-milovidov

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Nov 11, 2025
@EmeraldShift
Copy link
Copy Markdown
Contributor

EmeraldShift commented Nov 11, 2025

This sounds awesome, thank you! I wonder if there's a slightly slower, but more general, version of (2) that works with predicates, by roughly ranking the ranges to read by the minmax index or statistics for the ORDER BY column, then reading in that order?

For example, most tables are partitioned by time, and within a partition, recent parts are more likely to have recent data, so a query with ORDER BY timestamp DESC might be executed lot more efficiently by scanning partitions in reverse order, then parts in reverse block number order, then granules in some possibly shuffled order.

Maybe it will cause more random I/O, but I guess you're overall saving I/O by skipping low-sorted granules, and with a LIMIT, you're able to force the dynamic TopN threshold to fill with the top values first, so you can skip more granules overall.

Would this be possible? How does it sound?

@EmeraldShift
Copy link
Copy Markdown
Contributor

An even easier alternative to my proposal might be to do some internal "request chunking", like was recently introduced in ClickStack, where the query is broken up into ranges over the ORDER BY column, and the top ranges' queries are executed first, all before the sorting and limiting, so the lower ranges are never queried once the limit is hit.

@ahmadov ahmadov self-assigned this Nov 12, 2025
@shankar-iyer
Copy link
Copy Markdown
Member Author

@EmeraldShift Thank you for the inputs and sorry for the late reply!

you're able to force the dynamic TopN threshold to fill with the top values first, so you can skip more granules overall.
Would this be possible? How does it sound?

This sounds cool and if we do find a good threshold say within just reading/sorting 5%-10% of the data, that would really let the query skip reading vast number of granules. Let us try to firm up some logic into this PR - like you mentioned in your next comment. Maybe we do some specific logic for enqueue'ing read ranges if sorting is by Date / DateTime columns.

An even easier alternative to my proposal might be to do some internal "request chunking", like was recently introduced in ClickStack, where the query is broken up into ranges over the ORDER BY column, and the top ranges' queries are executed first, all before the sorting and limiting, so the lower ranges are never queried once the limit is hit.

Let me check this.

@shankar-iyer
Copy link
Copy Markdown
Member Author

shankar-iyer commented Nov 14, 2025

Some interesting runs with the optimizations -

Total rows in hits_indexed = 100 million
URLs like '%google%' -  only 15911
URLs like '%yandex%' - 14914657 (14 million)

Select top 10 with URL like '%google%' - good impact

select * from hits_indexed where URL like '%google%' order by EventTime limit 10 format null settings use_skip_indexes_for_top_n=0,use_top_n_dynamic_filtering=0,use_query_condition_cache=0;

Base
0 rows in set. Elapsed: 0.368 sec. Processed 100.00 million rows, 10.07 GB (271.56 million rows/s., 27.34 GB/s.)

With only skip index optimization (note the rows processed) - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 0.102 sec. Processed 11.65 million rows, 845.89 MB (114.63 million rows/s., 8.33 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 0.115 sec. Processed 100.00 million rows, 1.03 GB (873.17 million rows/s., 9.00 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 0.111 sec. Processed 11.44 million rows, 868.89 MB (103.06 million rows/s., 7.83 GB/s.)

Select top 1000 with URL like '%google%' - Not so good

Base
0 rows in set. Elapsed: 1.018 sec. Processed 100.00 million rows, 21.61 GB (98.26 million rows/s., 21.23 GB/s.)

With only skip index optimization - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 1.033 sec. Processed 100.00 million rows, 21.61 GB (96.78 million rows/s., 20.91 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 2.106 sec. Processed 100.00 million rows, 47.32 GB (47.48 million rows/s., 22.46 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 2.087 sec. Processed 78.22 million rows, 48.15 GB (37.49 million rows/s., 23.08 GB/s.)

With LIMIT 1000, the partial sorting threads do not get a block of 1000 rows after the WHERE clause and hence there is no threshold identified or the threshold is identified only very late in the query execution.

Select top 1000 with URL like '%yandex%'.- this is where the optimizations make an impact!

Base
0 rows in set. Elapsed: 3.068 sec. Processed 100.00 million rows, 61.17 GB (32.60 million rows/s., 19.94 GB/s.)

With only skip index optimization - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 0.299 sec. Processed 5.46 million rows, 3.21 GB (18.26 million rows/s., 10.73 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 0.261 sec. Processed 100.00 million rows, 3.29 GB (383.58 million rows/s., 12.61 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 0.259 sec. Processed 5.37 million rows, 2.95 GB (20.72 million rows/s., 11.40 GB/s.)

Select top 10000 with URL like '%yandex%'.- Good impact event with LIMIT 10000

Base
0 rows in set. Elapsed: 3.425 sec. Processed 100.00 million rows, 61.17 GB (29.20 million rows/s., 17.86 GB/s.)

With only skip index optimization - use_skip_indexes_for_top_n=1
0 rows in set. Elapsed: 0.485 sec. Processed 7.10 million rows, 4.13 GB (14.66 million rows/s., 8.53 GB/s.)

With only threshold prewhere optimization - use_top_n_dynamic_filtering=1
0 rows in set. Elapsed: 0.451 sec. Processed 100.00 million rows, 4.29 GB (221.87 million rows/s., 9.51 GB/s.)

With both optimizations ON
0 rows in set. Elapsed: 0.381 sec. Processed 7.02 million rows, 4.02 GB (18.45 million rows/s., 10.56 GB/s.)

Just to complete - Select top - 10 like '%yandex%'

base : 0.515.sec
With use_skip_indexes_for_top_n = ON :  0.089 sec
With use_top_n_dynamic_filtering = ON : 0.093 sec
With both ON : 0.085 sec

I am experimenting a bit more.

@EmeraldShift
Copy link
Copy Markdown
Contributor

Great results! 😁

Quick question: does this optimization support LIMIT BY? If it supports early return for a query like

SELECT trace_id
ORDER BY time DESC
LIMIT 1 BY trace_id
LIMIT 10

then this probably closes #75098!

@EmeraldShift
Copy link
Copy Markdown
Contributor

(and #65990)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements optimizations for ORDER BY ... LIMIT N queries (top-N queries) by leveraging skip indexes and dynamic threshold filtering to reduce the number of rows processed. The implementation introduces three main optimization strategies: (1) using minmax skip indexes to pre-select qualifying granules when no predicates exist, (2) dynamic threshold filtering during query execution using minmax indexes, and (3) PREWHERE-based filtering using dynamically computed thresholds.

Key changes include:

  • New TopNThresholdTracker component for thread-safe threshold value sharing across query execution components
  • Extension of minmax index functionality to support bulk granule operations and top-N filtering
  • Query plan optimization to detect and configure top-N scenarios
  • New internal function __topNFilter for PREWHERE-based filtering
  • Two new settings: use_skip_indexes_for_top_n and use_top_n_dynamic_filtering

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
src/Processors/TopNThresholdTracker.h New thread-safe component for tracking and sharing top-N threshold values across optimization components
src/Functions/FunctionTopNFilter.cpp New internal function for dynamic PREWHERE filtering based on evolving top-N thresholds
src/Storages/MergeTree/MergeTreeIndexMinMax.{h,cpp} Extended to support bulk granule reading and top-N mark selection
src/Storages/MergeTree/MergeTreeIndexReadResultPool.{h,cpp} Modified to support minmax index for top-N filtering and threshold tracking
src/Storages/MergeTree/MergeTreeReaderIndex.cpp Enhanced mark skipping logic to use dynamic threshold filtering
src/Storages/MergeTree/MergeTreeDataSelectExecutor.{h,cpp} Added top-N filtering using minmax indexes during part selection
src/Processors/Transforms/PartialSortingTransform.{h,cpp} Integrated threshold tracker to publish top-N values during sorting
src/Processors/Transforms/MergeSortingTransform.{h,cpp} Added threshold tracking and adaptive remerge logic for top-N optimization
src/Processors/QueryPlan/SortingStep.{h,cpp} Plumbed threshold tracker through sorting pipeline
src/Processors/QueryPlan/ReadFromMergeTree.{h,cpp} Added top-N filter info and skip index availability checking
src/Processors/QueryPlan/Optimizations/optimizeTopN.cpp New optimization pass to detect and configure top-N query patterns
src/Processors/QueryPlan/Optimizations/QueryPlanOptimizationSettings.{h,cpp} Added settings for controlling top-N optimizations
src/Processors/QueryPlan/Optimizations/Optimizations.h Registered new top-N optimization in the optimization list
src/Processors/QueryPlan/Optimizations/optimizeTree.cpp Propagated new settings through optimization passes
src/Core/Settings.cpp Added documentation for new top-N optimization settings
src/Core/SettingsChangesHistory.cpp Recorded new settings in version history
src/Functions/CMakeLists.txt Added FunctionTopNFilter to build
tests/queries/0_stateless/03711_top_n_by_skip_index.{sql,reference} Test cases for top-N optimization with skip indexes
tests/performance/top_n_optimization_hits.xml Performance benchmark for top-N optimization

&& remerge_is_useful
&& max_bytes_before_remerge
&& sum_bytes_in_blocks > max_bytes_before_remerge)
&& sum_bytes_in_blocks > max_bytes_before_remerge) || (threshold_tracker && (sum_rows_in_blocks > limit*1.5)))
Copy link

Copilot AI Nov 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number: The hardcoded value 1.5 should be extracted as a named constant (e.g., threshold_tracker_remerge_factor) to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.
@shankar-iyer
Copy link
Copy Markdown
Member Author

Great results! 😁

Quick question: does this optimization support LIMIT BY? If it supports early return for a query like

SELECT trace_id
ORDER BY time DESC
LIMIT 1 BY trace_id
LIMIT 10

then this probably closes #75098!

LIMIT BY and LIMIT...WITH TIES looks infeasible

I need to check implementation of LIMIT n OFFSET m and I am hoping to support multi-column sort e.g ORDER BY date, status (just use the first column as the threshold and tweak the threshold comparision from < to <=

@EmeraldShift
Copy link
Copy Markdown
Contributor

EmeraldShift commented Nov 17, 2025

I'm curious about what makes LIMIT BY (even when combined with an ordinary LIMIT) infeasible. I would imagine the LIMIT BY just acts as a filter to avoid collecting rows when there is already K results with the same limit key, unless the ORDER BY value is better. Once we have filtered and collected N results for LIMIT K BY ... LIMIT N, isn't the N'th result's ORDER BY key still a valid threshold to filter the remaining parts by? Or maybe LIMIT BY is actually implemented differently in ClickHouse, or I'm missing something subtle about how it works?

@EmeraldShift
Copy link
Copy Markdown
Contributor

Another random thought for the google test with LIMIT 1000: as you point out, no single partial sorting thread reaches a block of 1000, but maybe it's still possible to derive a looser threshold, if every thread reports its threshold, and the number of rows it has collected?

For example, three threads solving ORDER BY N DESC may report their Nth threshold value and counts like this:

thread 1: Nth=123, count=300
thread 2: Nth=120, count=250
thread 3: Nth=60, count=1250

Given this data, we know there are at least 300 rows with N >= 123, and 250 + 300 = 550 rows with N >= 120, so if the query had e.g. LIMIT 500 we could start using N > 120 as the PREWHERE and minmax filter.

In general, I suppose we can scan the threads in threshold order (in this case, following N DESC), and proceed until the sum of all counts seen so far is >= the LIMIT. If one thread already has enough rows, the algorithm will just pick its threshold. Otherwise, we'll find some slightly worse but still non-trivial threshold if all the threads combined can exceed the limit.

@EmeraldShift
Copy link
Copy Markdown
Contributor

Some more results from further testing:

  • I was able to get around this not working for Distributed tables by setting distributed_group_by_no_merge=1 to force distributed_push_down_limit to work (which I feel is a bug in distributed_push_down_limit since there is no GROUP BY, but anyway..)
  • I received a segmentation fault on one of the tables:
[host] 2025.12.01 08:56:58.608451 [ 854034 ] <Fatal> BaseDaemon: ########################################
[host] 2025.12.01 08:56:58.610126 [ 854034 ] <Fatal> BaseDaemon: (version 25.12.1.198, build id: 28D24C4F0FAB325F967BA4179CF4E169F0EFD7DA, git hash: c0e760d86866da5604808fc264ad6c85cf8ef318) (from thread 854868) (query_id: a4189ef9-1e18-4c65-8804-e6c93bd10ba9) (query: select * from repl.spans where system = 'AAAA'  order by start_time desc limit 100 settings use_query_condition_cache=0, use_skip_indexes_on_data_read, use_skip_indexes_for_top_n=1) Received signal Segmentation fault (11)
[host] 2025.12.01 08:56:58.610136 [ 854034 ] <Fatal> BaseDaemon: Address: 0x2b. Access: read. Address not mapped to object.
[host] 2025.12.01 08:56:58.610145 [ 854034 ] <Fatal> BaseDaemon: Stack trace: 0x00000000155adefb 0x00000000155ae318 0x000000001b35ff7c 0x000000001b30617d 0x000000001b33a3b5 0x000000001b338879 0x000000001c01dd4c 0x000000001b347338 0x000000001b34925b 0x000000001c05c7d2 0x000000001ba53cb6 0x000000001ba70b57 0x000000001ba6311b 0x000000001ba67143 0x00000000151a428b 0x00000000151ab1a6 0x00000000151a141f 0x00000000151a8ada 0x00007ffff70081ea 0x00007ffff6c399d3
[host] 2025.12.01 08:56:58.612274 [ 854034 ] <Fatal> BaseDaemon: 2. void std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__do_rehash<true>(unsigned long) @ 0x00000000155adefb
[host] 2025.12.01 08:56:58.612939 [ 854034 ] <Fatal> BaseDaemon: 3. std::pair<std::__hash_iterator<std::__hash_node<std::__hash_value_type<unsigned long, unsigned long>, void*>*>, bool> std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__emplace_unique_key_args<unsigned long, std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<>>(unsigned long const&, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) @ 0x00000000155ae318
[host] 2025.12.01 08:56:58.612955 [ 854034 ] <Fatal> BaseDaemon: 4. DB::MergeTreeReaderIndex::canSkipMark(unsigned long, unsigned long) @ 0x000000001b35ff7c
[host] 2025.12.01 08:56:58.612971 [ 854034 ] <Fatal> BaseDaemon: 5. DB::MergeTreeRangeReader::startReadingChain(unsigned long, DB::MarkRanges&) @ 0x000000001b30617d
[host] 2025.12.01 08:56:58.613586 [ 854034 ] <Fatal> BaseDaemon: 6. DB::MergeTreeReadersChain::read(unsigned long, DB::MarkRanges&, std::vector<DB::MarkRanges, std::allocator<DB::MarkRanges>>&) @ 0x000000001b33a3b5
[host] 2025.12.01 08:56:58.614260 [ 854034 ] <Fatal> BaseDaemon: 7. DB::MergeTreeReadTask::read() @ 0x000000001b338879
[host] 2025.12.01 08:56:58.614276 [ 854034 ] <Fatal> BaseDaemon: 8. DB::MergeTreeThreadSelectAlgorithm::readFromTask(DB::MergeTreeReadTask&) @ 0x000000001c01dd4c
[host] 2025.12.01 08:56:58.614286 [ 854034 ] <Fatal> BaseDaemon: 9. DB::MergeTreeSelectProcessor::readCurrentTask(DB::MergeTreeReadTask&, DB::IMergeTreeSelectAlgorithm&) const @ 0x000000001b347338
[host] 2025.12.01 08:56:58.614294 [ 854034 ] <Fatal> BaseDaemon: 10. DB::MergeTreeSelectProcessor::read() @ 0x000000001b34925b
[host] 2025.12.01 08:56:58.614320 [ 854034 ] <Fatal> BaseDaemon: 11. DB::MergeTreeSource::tryGenerate() @ 0x000000001c05c7d2
[host] 2025.12.01 08:56:58.614332 [ 854034 ] <Fatal> BaseDaemon: 12. DB::ISource::work() @ 0x000000001ba53cb6
[host] 2025.12.01 08:56:58.614344 [ 854034 ] <Fatal> BaseDaemon: 13. DB::ExecutionThreadContext::executeTask() @ 0x000000001ba70b57
[host] 2025.12.01 08:56:58.614968 [ 854034 ] <Fatal> BaseDaemon: 14. DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x000000001ba6311b
[host] 2025.12.01 08:56:58.615582 [ 854034 ] <Fatal> BaseDaemon: 15. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<DB::PipelineExecutor::spawnThreads(std::shared_ptr<DB::IAcquiredSlot>)::$_0>(std::__function::__policy_storage const*) @ 0x000000001ba67143
[host] 2025.12.01 08:56:58.615596 [ 854034 ] <Fatal> BaseDaemon: 16. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000151a428b
[host] 2025.12.01 08:56:58.615614 [ 854034 ] <Fatal> BaseDaemon: 17. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'()>(std::__function::__policy_storage const*) @ 0x00000000151ab1a6
[host] 2025.12.01 08:56:58.615623 [ 854034 ] <Fatal> BaseDaemon: 18. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000151a141f
[host] 2025.12.01 08:56:58.616246 [ 854034 ] <Fatal> BaseDaemon: 19. void* std::__thread_proxy[abi:ne210105]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000151a8ada
[host] 2025.12.01 08:56:58.616869 [ 854034 ] <Fatal> BaseDaemon: 20. Hdfs::Internal::GetSystemErrorInfo(int)::message (.llvm.18351245492790364349) @ 0x00000000000081ea
[host] 2025.12.01 08:56:58.617006 [ 854034 ] <Fatal> BaseDaemon: 21. clone @ 0x00000000000399d3
[host] 2025.12.01 08:56:58.617015 [ 854034 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
[host] 2025.12.01 08:57:08.629311 [ 854034 ] <Fatal> BaseDaemon: Changed settings: use_skip_indexes_on_data_read = true, use_skip_indexes_for_top_n = true, use_query_condition_cache = false

@shankar-iyer
Copy link
Copy Markdown
Member Author

shankar-iyer commented Dec 2, 2025

Some more results from further testing:

  • I was able to get around this not working for Distributed tables by setting distributed_group_by_no_merge=1 to force distributed_push_down_limit to work (which I feel is a bug in distributed_push_down_limit since there is no GROUP BY, but anyway..)
  • I received a segmentation fault on one of the tables:
[host] 2025.12.01 08:56:58.608451 [ 854034 ] <Fatal> BaseDaemon: ########################################
[host] 2025.12.01 08:56:58.610126 [ 854034 ] <Fatal> BaseDaemon: (version 25.12.1.198, build id: 28D24C4F0FAB325F967BA4179CF4E169F0EFD7DA, git hash: c0e760d86866da5604808fc264ad6c85cf8ef318) (from thread 854868) (query_id: a4189ef9-1e18-4c65-8804-e6c93bd10ba9) (query: select * from repl.spans where system = 'AAAA'  order by start_time desc limit 100 settings use_query_condition_cache=0, use_skip_indexes_on_data_read, use_skip_indexes_for_top_n=1) Received signal Segmentation fault (11)
[host] 2025.12.01 08:56:58.610136 [ 854034 ] <Fatal> BaseDaemon: Address: 0x2b. Access: read. Address not mapped to object.
[host] 2025.12.01 08:56:58.610145 [ 854034 ] <Fatal> BaseDaemon: Stack trace: 0x00000000155adefb 0x00000000155ae318 0x000000001b35ff7c 0x000000001b30617d 0x000000001b33a3b5 0x000000001b338879 0x000000001c01dd4c 0x000000001b347338 0x000000001b34925b 0x000000001c05c7d2 0x000000001ba53cb6 0x000000001ba70b57 0x000000001ba6311b 0x000000001ba67143 0x00000000151a428b 0x00000000151ab1a6 0x00000000151a141f 0x00000000151a8ada 0x00007ffff70081ea 0x00007ffff6c399d3
[host] 2025.12.01 08:56:58.612274 [ 854034 ] <Fatal> BaseDaemon: 2. void std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__do_rehash<true>(unsigned long) @ 0x00000000155adefb
[host] 2025.12.01 08:56:58.612939 [ 854034 ] <Fatal> BaseDaemon: 3. std::pair<std::__hash_iterator<std::__hash_node<std::__hash_value_type<unsigned long, unsigned long>, void*>*>, bool> std::__hash_table<std::__hash_value_type<unsigned long, unsigned long>, std::__unordered_map_hasher<unsigned long, std::pair<unsigned long const, unsigned long>, std::hash<unsigned long>, std::equal_to<unsigned long>, true>, std::__unordered_map_equal<unsigned long, std::pair<unsigned long const, unsigned long>, std::equal_to<unsigned long>, std::hash<unsigned long>, true>, std::allocator<std::pair<unsigned long const, unsigned long>>>::__emplace_unique_key_args<unsigned long, std::piecewise_construct_t const&, std::tuple<unsigned long const&>, std::tuple<>>(unsigned long const&, std::piecewise_construct_t const&, std::tuple<unsigned long const&>&&, std::tuple<>&&) @ 0x00000000155ae318
[host] 2025.12.01 08:56:58.612955 [ 854034 ] <Fatal> BaseDaemon: 4. DB::MergeTreeReaderIndex::canSkipMark(unsigned long, unsigned long) @ 0x000000001b35ff7c
[host] 2025.12.01 08:56:58.612971 [ 854034 ] <Fatal> BaseDaemon: 5. DB::MergeTreeRangeReader::startReadingChain(unsigned long, DB::MarkRanges&) @ 0x000000001b30617d
[host] 2025.12.01 08:56:58.613586 [ 854034 ] <Fatal> BaseDaemon: 6. DB::MergeTreeReadersChain::read(unsigned long, DB::MarkRanges&, std::vector<DB::MarkRanges, std::allocator<DB::MarkRanges>>&) @ 0x000000001b33a3b5
[host] 2025.12.01 08:56:58.614260 [ 854034 ] <Fatal> BaseDaemon: 7. DB::MergeTreeReadTask::read() @ 0x000000001b338879
[host] 2025.12.01 08:56:58.614276 [ 854034 ] <Fatal> BaseDaemon: 8. DB::MergeTreeThreadSelectAlgorithm::readFromTask(DB::MergeTreeReadTask&) @ 0x000000001c01dd4c
[host] 2025.12.01 08:56:58.614286 [ 854034 ] <Fatal> BaseDaemon: 9. DB::MergeTreeSelectProcessor::readCurrentTask(DB::MergeTreeReadTask&, DB::IMergeTreeSelectAlgorithm&) const @ 0x000000001b347338
[host] 2025.12.01 08:56:58.614294 [ 854034 ] <Fatal> BaseDaemon: 10. DB::MergeTreeSelectProcessor::read() @ 0x000000001b34925b
[host] 2025.12.01 08:56:58.614320 [ 854034 ] <Fatal> BaseDaemon: 11. DB::MergeTreeSource::tryGenerate() @ 0x000000001c05c7d2
[host] 2025.12.01 08:56:58.614332 [ 854034 ] <Fatal> BaseDaemon: 12. DB::ISource::work() @ 0x000000001ba53cb6
[host] 2025.12.01 08:56:58.614344 [ 854034 ] <Fatal> BaseDaemon: 13. DB::ExecutionThreadContext::executeTask() @ 0x000000001ba70b57
[host] 2025.12.01 08:56:58.614968 [ 854034 ] <Fatal> BaseDaemon: 14. DB::PipelineExecutor::executeStepImpl(unsigned long, DB::IAcquiredSlot*, std::atomic<bool>*) @ 0x000000001ba6311b
[host] 2025.12.01 08:56:58.615582 [ 854034 ] <Fatal> BaseDaemon: 15. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<DB::PipelineExecutor::spawnThreads(std::shared_ptr<DB::IAcquiredSlot>)::$_0>(std::__function::__policy_storage const*) @ 0x000000001ba67143
[host] 2025.12.01 08:56:58.615596 [ 854034 ] <Fatal> BaseDaemon: 16. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000151a428b
[host] 2025.12.01 08:56:58.615614 [ 854034 ] <Fatal> BaseDaemon: 17. void std::__function::__policy_func<void ()>::__call_func[abi:ne210105]<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'()>(std::__function::__policy_storage const*) @ 0x00000000151ab1a6
[host] 2025.12.01 08:56:58.615623 [ 854034 ] <Fatal> BaseDaemon: 18. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000151a141f
[host] 2025.12.01 08:56:58.616246 [ 854034 ] <Fatal> BaseDaemon: 19. void* std::__thread_proxy[abi:ne210105]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000151a8ada
[host] 2025.12.01 08:56:58.616869 [ 854034 ] <Fatal> BaseDaemon: 20. Hdfs::Internal::GetSystemErrorInfo(int)::message (.llvm.18351245492790364349) @ 0x00000000000081ea
[host] 2025.12.01 08:56:58.617006 [ 854034 ] <Fatal> BaseDaemon: 21. clone @ 0x00000000000399d3
[host] 2025.12.01 08:56:58.617015 [ 854034 ] <Fatal> BaseDaemon: Integrity check of the executable skipped because the reference checksum could not be read.
[host] 2025.12.01 08:57:08.629311 [ 854034 ] <Fatal> BaseDaemon: Changed settings: use_skip_indexes_on_data_read = true, use_skip_indexes_for_top_n = true, use_query_condition_cache = false

@EmeraldShift By any chance, were you testing without commit - 9cd57bf ?

Is skip index defined on the order by column, but index is not yet materialized (on some parts) ?

@EmeraldShift
Copy link
Copy Markdown
Contributor

The testing is on commit c0e760d86866, which is a merge of 2cf191a784b6 and master, which does contain 9cd57bfe439d.

Here's the table schema:

CREATE TABLE repl.spans
(
    `environment` LowCardinality(String) CODEC(ZSTD(1)),
    `system` LowCardinality(String) CODEC(ZSTD(1)),
    `service` LowCardinality(String) CODEC(ZSTD(1)),
    `task` LowCardinality(String) CODEC(ZSTD(1)),
    `task_process` LowCardinality(String) CODEC(ZSTD(1)),
    `host` LowCardinality(String) CODEC(ZSTD(1)),
    `host_location` LowCardinality(String) CODEC(ZSTD(1)),
    `host_region` LowCardinality(String) CODEC(ZSTD(1)),
    `user` LowCardinality(String) CODEC(ZSTD(1)),
    `name` LowCardinality(String) CODEC(ZSTD(1)),
    `span_id` FixedString(8) CODEC(NONE),
    `trace_id` FixedString(16) CODEC(ZSTD(1)),
    `parent_id` FixedString(8) CODEC(ZSTD(1)),
    `start_time` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `end_time` DateTime64(9) CODEC(Delta(8), ZSTD(1)),
    `duration` UInt64 CODEC(ZSTD(1)),
    `logs.timestamps` Array(DateTime64(9)) CODEC(ZSTD(1)),
    `logs.labels.names` Array(Array(LowCardinality(String))) CODEC(ZSTD(1)),
    `logs.labels.values` Array(Array(String)) CODEC(ZSTD(1)),
    `references.trace_ids` Array(FixedString(16)) CODEC(ZSTD(1)),
    `references.span_ids` Array(FixedString(8)) CODEC(ZSTD(1)),
    `references.ref_types` Array(Enum8('CHILD_OF' = 0, 'FOLLOWS_FROM' = 1)) CODEC(ZSTD(1)),
    `labels.names` Array(LowCardinality(String)) CODEC(ZSTD(1)),
    `labels.values` Array(String) CODEC(ZSTD(1)),
    `warnings` Array(String) CODEC(ZSTD(1)),
    `status` Enum8('UNSET' = 0, 'OK' = 1, 'ERROR' = 2) CODEC(ZSTD(1)),
    `label_names` LowCardinality(String) MATERIALIZED arrayStringConcat(arraySort(labels.names), '|'),
    `label_names2` LowCardinality(String) MATERIALIZED concat('|', arrayStringConcat(arraySort(labels.names), '|'), '|'),
    `label_values` String MATERIALIZED concat('|', arrayStringConcat(arraySort(labels.values), '|'), '|'),
    INDEX ix_bf_trace_id trace_id TYPE bloom_filter(0.001) GRANULARITY 1,
    INDEX ix_sbf_labels_values labels.values TYPE sparse_grams(3, 100, 8192, 2, 0) GRANULARITY 1,
    INDEX ix_mm_start_time start_time TYPE minmax GRANULARITY 1,
    INDEX ix_mm_start_time_64 start_time TYPE minmax GRANULARITY 64,
    PROJECTION by_trace_id
    (
        SELECT _part_offset
        ORDER BY trace_id
    ),
    PROJECTION by_duration
    (
        SELECT _part_offset
        ORDER BY duration
    )
)
ENGINE = ReplicatedMergeTree
PARTITION BY toStartOfHour(start_time)
ORDER BY (environment, system, service, task, name, start_time)
SETTINGS index_granularity = 8192, ttl_only_drop_parts = 1
  • As far as I can tell (my MATERIALIZE INDEX mutation is done), the index should be in all parts. Is there a way to check for sure whether the index is fully materialized?
  • I do have start_time in my ORDER BY, and also I have two skip indexes over it with different granularity. Do I understand correctly that this feature will try to use one of them? Is there a log that can tell me which one?

@EmeraldShift
Copy link
Copy Markdown
Contributor

I'll re-test my query on the granularity = 1 commit later today

@EmeraldShift
Copy link
Copy Markdown
Contributor

I can no longer reproduce the segmentation fault on the latest commit 😁

@shankar-iyer
Copy link
Copy Markdown
Member Author

I can no longer reproduce the segmentation fault on the latest commit 😁

Thanks!

@EmeraldShift
Copy link
Copy Markdown
Contributor

Would it be possible to also leverage the primary ORDER BY key if the data is [partially] sorted by the query's ORDER BY? Some random thoughts:

  1. If the primary key contains marks for the query's ORDER BY column, can we efficiently read those marks from memory instead of reading a skip index from disk?
  2. If there are long sorted runs of marks, can we perform the filter against larger groups of marks at a time, similar to generic exclusion search? For example, table ORDER BY (a, b) could have marks (1, 1), (1, 3), (1, 5), (1, 7) and a threshold of b < 2 can skip all of the middle marks. Maybe it is faster to find whole ranges to read?
  3. If the table is sorted by (a, b, c) and the query uses ORDER BY c DESC, and the topK optimization is on, should we read InReverseOrder? Then the first granules to read will fill the threshold early with good values, and we can skip more data. Currently it tries to read either thread or InOrder, which is sub-optimal especially when max_threads=1.

This feature looks great as-is, and is already a huge improvement! I'm mostly asking if this sounds like reasonable follow-on work to continue optimizing this shape of query, which is extremely common in my logging use case! 😁

@shankar-iyer shankar-iyer added this pull request to the merge queue Dec 7, 2025
Merged via the queue into ClickHouse:master with commit 01e3689 Dec 7, 2025
124 of 130 checks passed
@shankar-iyer shankar-iyer deleted the top_n_optimization_by_skip_index_and_prewhere branch December 7, 2025 16:50
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Dec 7, 2025
@alexey-milovidov alexey-milovidov added the 🎅 🎁 gift🎄 To make people wonder label Dec 7, 2025
UInt64 limit_,
bool skip_partial_sort = false);
bool skip_partial_sort = false,
TopKThresholdTrackerPtr threshold_tracker = nullptr);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shankar-iyer passing TopKThresholdTrackerPtr breaks the ability to serialize/deserialize SortingStep and makes this feature incompatible with distributed queries. Plan step must be a "description" of what is done that can be easily passed over network to a different node and only initializePipeline/updatePipeline methods should convert this description into runtime structures that implement the actual processing logic.
Maybe you can use similar approach to what is done for join runtime filters - they are "registgered" in query context under unique names and looked up by his name at runtime: https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/Context.h#L283

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davenger Thanks for the review and spotting this! Let me check and I will update accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🎅 🎁 gift🎄 To make people wonder pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants