Support rewrite to optimize order by limit by acking-you · Pull Request #82478 · ClickHouse/ClickHouse

acking-you · 2025-06-24T08:04:37Z

Summary

Relevant issue: #79645
This PR optimizes order by limit based on where (xx) in subquery, with the specific description as follows:

For SQL statements with a large LIMIT value, such as order by limit 100000, significant performance improvements can be achieved. Below are the test results on the hits dataset, showing a performance increase of nearly 100 times compared to ColumnLazy.

The test version is 25.7.1.1, and the following query results are all hot queries (execute three times and take the best result):

-- original query cannot proceed normally (memory usage exceeds the limit)
SELECT * FROM hits ORDER BY EventTime LIMIT 100000;

-- enable ColumnLazy
set query_plan_max_limit_for_lazy_materialization=200000;
SELECT * FROM hits ORDER BY EventTime LIMIT 100000;

100000 rows in set. Elapsed: 71.017 sec. Processed 78.00 million rows, 935.97 MB (1.10 million rows/s., 13.18 MB/s.)
Peak memory usage: 855.88 MiB.


-- rewrite order by limit
SELECT* from hits where (_part_starting_offset + _part_offset) IN (SELECT _part_starting_offset+_part_offset FROM hits ORDER BY EventTime LIMIT 100000) ORDER BY EventTime;

100000 rows in set. Elapsed: 0.784 sec. Processed 81.78 million rows, 3.23 GB (104.32 million rows/s., 4.12 GB/s.)
Peak memory usage: 776.61 MiB.

The performance comparison is as follows:

Method Description	Execution Time	Speedup Factor
Original query (fails) `SELECT * FROM hits ORDER BY EventTime LIMIT 100000`	❌ Fails (Memory limit exceeded)	-
Lazy materialization enabled `SET query_plan_max_limit...` + original query	71.017 seconds	1.0x (Baseline)
Optimized rewritten query `SELECT * FROM hits WHERE ... IN (subquery)`	0.784 seconds	90.6x faster (71.017 ÷ 0.784)

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added RewriteOrderByLimitPass in Analyzer for pattern recognition and rewriting
Added configuration parameters:
- query_plan_rewrite_order_by_limit
- query_plan_max_limit_for_rewrite_order_by_limit
- query_plan_min_columns_to_use_rewrite_order_by_limit
Disabled by default (incompatible with distributed tables)
Default limit: 1 million rows (matching DuckDB's implementation)

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Add the capability to rewrite "order by limit" into a subquery based on rowid.

Known Issues

This optimization is only applicable to non-distributed tables for the following reasons:

Using (part_starting_offset+part_offset) or (part,part_offset) to identify a row is only valid in the case of non-distributed tables.
A stable global snapshot is required (possibly already resolved by PR).

Possible solutions are:

When querying distributed tables, manually rewrite the local plan during createLocalPlan.
Add some virtual columns (e.g., ip, shard_name) to identify a row in the case of distributed tables.

novikd · 2025-06-24T09:49:30Z

Could you please provide a test for it?

acking-you · 2025-06-24T09:58:27Z

Could you please provide a test for it?

okey,I will do it later

UnamedRus · 2025-06-24T10:49:18Z

Another alternative is to use _part_granule_offset #82341.
Because, you in any case need to read most of rows from specific granule, but it will reduce size of IN statement for big LIMIT values and reduce cost of index lookup by it.

Also, make sense to wrap it in indexHint function, to only use it in index lookup.

acking-you · 2025-06-24T11:40:54Z

Another alternative is to use _part_granule_offset #82341. Because, you in any case need to read most of rows from specific granule, but it will reduce size of IN statement for big LIMIT values and reduce cost of index lookup by it.

Also, make sense to wrap it in indexHint function, to only use it in index lookup.

Thank you very much for your suggestion!

In fact, this optimization is not only suitable for large data volumes but also performs well with small data volumes (e.g., limit 10).

I tried wrapping the where condition with indexHint, which resulted in a two to threefold performance drop (tested with limit 10/1000/100000).
The introduction of either _part_granule_offset or indexHint causes the main query to scan more data, which might be the root cause of the performance degradation.

acking-you · 2025-06-24T12:08:00Z

Could you please provide a test for it?

done @novikd

amosbird · 2025-06-24T12:52:08Z

Another alternative is to use _part_granule_offset

@UnamedRus I don't think it's feasible. We still need row offsets to retrieve the Top K rows, and index analysis for _part_granule_offset hasn't been implemented — mainly because I believe there are no valid use cases that depend on it - though I might be wrong about this assumption...

clickhouse-gh · 2025-06-24T17:26:21Z

Workflow [PR], commit [eb995ef]

Summary: ❌

job_name	test_name	status
Stateless tests (amd_binary, ParallelReplicas, s3 storage)		failure
	02233_HTTP_ranged	FAIL
Integration tests (asan, old analyzer, 4/6)		failure
	Job Timeout Expired	FAIL
Stress test (amd_ubsan)		failure
	Server died	FAIL
	Hung check failed, possible deadlock found (see hung_check.log)	FAIL
	Killed by signal (in clickhouse-server.log)	FAIL
	Fatal message in clickhouse-server.log (see fatal_messages.txt)	FAIL
	Killed by signal (output files)	FAIL
	Found signal in gdb.log	FAIL

EmeraldShift · 2025-06-24T21:49:06Z

index analysis for _part_granule_offset hasn't been implemented — mainly because I believe there are no valid use cases that depend on it

Could it be useful in an inverted projection index to efficiently store granules to check? Maybe there are cases where there would be too much data if you store the individual offsets, and you're willing to spend more time re-checking the indexed predicate upon reading the data rather than getting it from the index? Something like

PROJECTION prj (SELECT groupBitmap(_part_granule_offset) GROUP BY field ORDER BY field)

where field correlates strongly with granules?

UnamedRus · 2025-06-24T21:59:17Z

We still need row offsets to retrieve the Top K rows,

It's more like, If we are going to read most of granula anyway, wouldn't it be simpler to read it as whole and just sort more rows?
Plus, data is usually time sorted, so all interesting timestamp and rows will probably reside in the same granules, so actual overhead of sorting (for all data from granulas) wouldn't be that big.

It's bit of stretch, but example when this happens. (plus second subquery is more "expensive" to run)

SELECT *
FROM hits
WHERE (CounterID = 105857) AND ((_part, _part_offset) IN (
    SELECT
        _part,
        _part_offset
    FROM hits
    WHERE CounterID = 105857
    ORDER BY EventTime DESC
    LIMIT 20000
))
ORDER BY EventTime DESC
LIMIT 20000
FORMAT `Null`

Ok.

0 rows in set. Elapsed: 0.080 sec. Processed 5.82 million rows, 114.52 MB (72.61 million rows/s., 1.43 GB/s.)
Peak memory usage: 41.61 MiB.

SELECT *
FROM hits
WHERE (CounterID = 105857) AND indexHint((_part, _part_offset) IN (
    SELECT *
    FROM
    (
        SELECT
            _part,
            intDiv(_part_offset, 8192) * 8192 AS offset
        FROM hits
        WHERE CounterID = 105857
        ORDER BY EventTime DESC
        LIMIT 20000
    )
    LIMIT 1 BY
        _part,
        offset
))
ORDER BY EventTime DESC
LIMIT 20000
FORMAT `Null`

Ok.

0 rows in set. Elapsed: 0.072 sec. Processed 5.82 million rows, 114.26 MB (80.63 million rows/s., 1.58 GB/s.)
Peak memory usage: 52.32 MiB.

acking-you · 2025-06-25T04:52:13Z

Now, by modifying the manual invocation of rewrite in buildQueryTreeForShard and implementing logic in rewrite to identify and optimize only for StorageMergeTree, we've successfully enabled this optimization in distributed tables!

amosbird · 2025-06-25T05:11:16Z

It's more like, If we are going to read most of granula anyway, wouldn't it be simpler to read it as whole and just sort more rows?

But you'll still need to apply filters again. While there might be edge cases that benefit from this approach, ClickHouse already handles IN sets with around a million elements quite efficiently. I’m skeptical there's a real use case for doing top-N over billions :)

amosbird · 2025-06-25T05:15:02Z

Could it be useful in an inverted projection index to efficiently store granules to check?

The main benefits of the projection index (row-level index) are twofold:

Predicate Filtering – It allows computing a bitmap (row-level filter) directly from the projection index and applying filters at the very beginning of the PREWHERE chain. This has been implemented in PR #81021.
Expression Materialization – It enables serving precomputed values for certain expressions directly from the projection index, which avoids both expression evaluation and reading the underlying source columns.

If we only track granules, we can only support the first benefit — and its actual impact can vary significantly depending on workload and data distribution in real-world scenarios.

JiaQiTang98 · 2025-07-01T06:35:20Z

An additional question, does it support pushing optimize_read_in_order into projection? Or will it?

acking-you · 2025-07-01T11:38:29Z

An additional question, does it support pushing optimize_read_in_order into projection? Or will it?

Great point!

I think that whether to support the pushdown of optimize_read_in_order to projection is unrelated to this PR. I noticed that the current implementation of optimizeUseNormalProjections does not seem to account for this. It appears to only support utilizing simple WHERE conditions in projections.

Here is my testing process:

CREATE TABLE mydata (`A` Int64, `B` Int64, `C` String, projection p (select A,B,_part_offset order by B) ) ENGINE = MergeTree ORDER BY A AS SELECT number AS A, 999999 - number AS B, if(number between 1000 and 2000, 'x', toString(number)) AS C FROM numbers(1000000);
insert into mydata SELECT *  from mydata  where B > 14 limit 20;
insert into mydata values(1000001,-100,'a');

-- where can be optimized by the projection
select * from mydata where _part_offset+_part_starting_offset in( select _part_offset+_part_starting_offset from mydata where B < 10 order by B limit 10);
10 rows in set. Elapsed: 0.038 sec. Processed 8.77 thousand rows, 88.13 KB (231.96 thousand rows/s., 2.33 MB/s.)
Peak memory usage: 472.89 KiB.

-- optimize_read_in_order can be applied to non-projection
select * from mydata where _part_offset+_part_starting_offset in( select _part_offset+_part_starting_offset from mydata  order by A limit 10) settings optimize_read_in_order=1;
10 rows in set. Elapsed: 0.027 sec. Processed 16.41 thousand rows, 430.40 KB (607.41 thousand rows/s., 15.94 MB/s.)
Peak memory usage: 448.11 KiB.
-- Cannot be applied to projection
select * from mydata where _part_offset+_part_starting_offset in( select _part_offset+_part_starting_offset from mydata  order by B limit 10) settings optimize_read_in_order=1;

I understand that as long as the original read process in optimizeUseNormalProjections supports pushing down optimize_read_in_order to the projection, this optimization will also benefit.

The potential concern is: will the read process of optimize_read_in_order cause an incorrect representation of row_id (_part_starting_offset + _part_offset)?

clickhouse-gh · 2025-08-05T13:23:05Z

Dear @novikd, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

EmeraldShift · 2025-08-28T19:29:40Z

I manually tried this rewrite for a use case where the right-hand side of IN is a very large ~90 million row set. It spent >9 seconds in CreatingSets. Is this an expected performance bottleneck of this approach? I separately commented on it in more detail here, but I don't know if that PR is the right fix.

acking-you · 2025-08-30T17:34:39Z

It spent >9 seconds in CreatingSets. Is this an expected performance bottleneck of this approach?

Yes, this is expected. But I understand that doing so is similar to splitting a single SQL into two executions. The second execution already clearly identifies which data needs to be scanned, which is very effective in scenarios where a full table scan is required by default, such as the "order by limit" issue resolved in this PR. If you need to avoid the involvement of the CreatingSets operator, the previous ColumnLazy was a good solution. However, it also requires a secondary table-rewriting operation when data is actually read, and it is difficult to make full use of the existing parallel reading pipeline. I haven't thought of a better solution for now.

EmeraldShift · 2025-11-18T22:58:17Z

Does anyone know why this is so much faster than lazy materialization? I'm looking into a logging use case with search queries like this:

SELECT *
FROM distributed_logs_table
WHERE ...
ORDER BY time DESC
LIMIT 500

At a limit of 500, should I expect this feature to still be ~100x faster (!!!) than ColumnLazy? What are the trade-offs?

Also, in the meantime, without this PR merged yet, is there any way to get the correct behavior with a Distributed table by manually rewriting the query, or is it impossible?

acking-you · 2025-11-19T16:04:07Z

Does anyone know why this is so much faster than lazy materialization?

When I tested before, this gap was mainly because ColumLazy was unable to read in parallel when performing final materialized reading. I don’t know if the latest version of ClickHouse has solved this problem.

What are the trade-offs?

SELECT *
FROM distributed_logs_table
WHERE ...
ORDER BY time DESC
LIMIT 500

I think if the data itself is order by time, there should be no advantage in using delayed materialization, including rewriting the query statement. I think you can try turning off delayed materialization.

EmeraldShift · 2025-12-04T23:18:03Z

Now, by modifying the manual invocation of rewrite in buildQueryTreeForShard and implementing logic in rewrite to identify and optimize only for StorageMergeTree, we've successfully enabled this optimization in distributed tables!

I'm curious, can you say more about how this optimization works for Distributed tables? Does it perform a local LIMIT n on each shard, then merge? Or does it do some two-phase thing where it collects a global n-size set of offsets, then fetch the rows? If it's the first one, does it depend on distributed_push_down_limit? (for some reason this pushdown doesn't seem to work for me by default)

acking-you · 2025-12-08T15:31:44Z

I'm curious, can you say more about how this optimization works for Distributed tables? Does it perform a local LIMIT n on each shard, then merge? does it depend on distributed_push_down_limit? (for some reason this pushdown doesn't seem to work for me by default)

The current implementation of distributed tables is very simple—essentially just a rewrite over each local table. Your guess is correct: it performs a local LIMIT n on each shard, then merges the results. This is because providing global support across distributed tables involves many challenges at the moment. Whether or not distributed_push_down_limit is used depends on whether the rewritten QueryTree on each shard contains a LIMIT clause. I believe this setting is enabled by default.

Or does it do some two-phase thing where it collects a global n-size set of offsets, then fetch the rows?

We're implementing a two-phase approach to achieve proper global ORDER BY with LIMIT:

Phase 1: Collect globally-sorted RowIDs

-- Rewritten internally
INSERT INTO _rowid_collector()
SELECT _rowid FROM distributed_table 
WHERE condition 
ORDER BY sort_col 
LIMIT 100  -- This is GLOBAL limit

Here, _rowid_collector() is a table function that creates a temporary in-memory storage table.

Phase 2: Fetch exact rows

-- Using collected RowIDs
SELECT cols FROM distributed_table 
WHERE _rowid IN (...100 exact IDs...)

Key difference from per-shard LIMIT: Phase 1 performs global sorting across all shards (not just local per shard), ensuring correct top-100 results.

Bonus: Once RowIDs are stored in memory, we can support asynchronous fetching for simple queries that can use these RowIDs directly (though not for aggregations that would require re-scanning data).

KochetovNicolai · 2025-12-18T13:59:38Z

@acking-you

The improved implementation of lazy materialization in 25.12 shows me about the same speedup

:) SELECT * from test.hits where (_part_starting_offset + _part_offset) IN (SELECT _part_starting_offset+_part_offset FROM test.hits ORDER BY EventTime LIMIT 100000) ORDER BY EventTime format Null settings use_query_condition_cache=0;

SELECT *
FROM test.hits
WHERE (_part_starting_offset + _part_offset) IN (
    SELECT _part_starting_offset + _part_offset
    FROM test.hits
    ORDER BY EventTime ASC
    LIMIT 100000
)
ORDER BY EventTime ASC
FORMAT `Null`
SETTINGS use_query_condition_cache = 0

Query id: a63f5e43-9fd5-4d7e-91f9-09eaedcf61f3

Ok.

0 rows in set. Elapsed: 0.332 sec. Processed 7.76 million rows, 769.19 MB (23.37 million rows/s., 2.32 GB/s.)
Peak memory usage: 446.44 MiB.

:) SELECT * from test.hits ORDER BY EventTime LIMIT 100000 settings query_plan_max_limit_for_lazy_materialization=200000, use_query_condition_cache = 0 format Null

SELECT *
FROM test.hits
ORDER BY EventTime ASC
LIMIT 100000
SETTINGS query_plan_max_limit_for_lazy_materialization = 200000, use_query_condition_cache = 0
FORMAT `Null`

Query id: 43f78e0e-5b70-46be-97ee-02784f4366c6

Ok.

0 rows in set. Elapsed: 0.277 sec. Processed 7.76 million rows, 744.69 MB (28.02 million rows/s., 2.69 GB/s.)
Peak memory usage: 267.48 MiB.

acking-you · 2025-12-18T14:13:36Z

The improved implementation of lazy materialization in 25.12 shows me about the same speedup

great job!❤️

acking-you force-pushed the support_order_by_limit_rewrite branch from 7c4a818 to db759d5 Compare June 24, 2025 08:09

novikd self-assigned this Jun 24, 2025

novikd added the can be tested Allows running workflows for external contributors label Jun 24, 2025

acking-you force-pushed the support_order_by_limit_rewrite branch from 75d4f55 to d257ad7 Compare June 25, 2025 03:40

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Jun 25, 2025

amosbird mentioned this pull request Jun 25, 2025

ClickHouse Performance Optimizations by Tencent ClickHouse/ClickBench#412

Merged

acking-you added 11 commits June 26, 2025 10:45

support rewrite to optimize order by limit

c40fad6

add test

22662c3

support rewrite in StorageDistributed

a1bf114

add test for distributed table

e0850da

fix test

2c9e473

fix test

568fb43

fix distributed table when read from replica

857e05d

fix CI

0c4209d

fix CI

f89b631

fix CI

fb7de02

fix CI

eb995ef

acking-you force-pushed the support_order_by_limit_rewrite branch from c0e9198 to eb995ef Compare June 26, 2025 02:45

clickhouse-gh bot unassigned novikd Aug 5, 2025

amosbird mentioned this pull request Sep 8, 2025

WIP some perf optimizations #81944

Draft

1 task

acking-you closed this Dec 18, 2025

Conversation

acking-you commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Known Issues

Uh oh!

novikd commented Jun 24, 2025

Uh oh!

acking-you commented Jun 24, 2025

Uh oh!

UnamedRus commented Jun 24, 2025

Uh oh!

acking-you commented Jun 24, 2025

Uh oh!

acking-you commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amosbird commented Jun 24, 2025

Uh oh!

clickhouse-gh bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EmeraldShift commented Jun 24, 2025

Uh oh!

UnamedRus commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acking-you commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025

Uh oh!

amosbird commented Jun 25, 2025

Uh oh!

JiaQiTang98 commented Jul 1, 2025

Uh oh!

acking-you commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Aug 5, 2025

Uh oh!

EmeraldShift commented Aug 28, 2025

Uh oh!

acking-you commented Aug 30, 2025

Uh oh!

EmeraldShift commented Nov 18, 2025

Uh oh!

acking-you commented Nov 19, 2025

Uh oh!

EmeraldShift commented Dec 4, 2025

Uh oh!

acking-you commented Dec 8, 2025

Uh oh!

KochetovNicolai commented Dec 18, 2025

Uh oh!

acking-you commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

acking-you commented Jun 24, 2025 •

edited

Loading

acking-you commented Jun 24, 2025 •

edited

Loading

clickhouse-gh bot commented Jun 24, 2025 •

edited

Loading

UnamedRus commented Jun 24, 2025 •

edited

Loading

acking-you commented Jul 1, 2025 •

edited

Loading