Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K#98607
Implement fused TopNAggregation operator for GROUP BY ... ORDER BY aggregate LIMIT K#98607murphy-4o wants to merge 49 commits intoClickHouse:masterfrom
Conversation
…gregate LIMIT K Add `TopNAggregatingStep` and `TopNAggregatingTransform` that fuse aggregation, sorting, and limiting into a single pass for queries like: SELECT key, max(ts) AS m FROM t GROUP BY key ORDER BY m DESC LIMIT 5 The optimization replaces the Aggregating → Sorting → Limit plan chain with a single `TopNAggregating` step. Two modes are supported: - Mode 1 (sorted input): when the table's ORDER BY key matches the aggregate argument, the operator requests reverse reading from `ReadFromMergeTree` and stops after K unique groups — O(K) memory, minimal rows read. - Mode 2 (unsorted input): processes all rows, aggregates per group, then partial-sorts to select the top K — eliminates the full sort. Supported aggregates: `min`, `max`, `any`, `argMin`, `argMax`. Each declares its compatibility via a new `getTopKAggregateInfo` virtual method on `IAggregateFunction`. The optimization is gated by the `optimize_topn_aggregation` setting (default off). Made-with: Cursor
1. Block optimization when OFFSET > 0 — the optimizer now checks `LimitStep::getOffset` and bails out, preventing incorrect results for `LIMIT K OFFSET N` queries. 2. Eliminate per-row arena growth in key serialization — replaced `serializeValueIntoArena` (which allocated into the arena on every row) with a pure `SipHash`-128 key. Arena is now only used for aggregate state allocation, bounded by number of groups. 3. Preserve full `SortColumnDescription` — the optimizer now copies the original `nulls_direction` and `collator` from the `SortingStep` instead of rebuilding with defaults. Mode 2's `compareAt` uses the correct `nulls_direction` for NULL ordering. 4. Document mode-2 as intentional v1 simplification — the header comment now states that threshold-based pruning is not yet implemented. 5. Add test coverage for OFFSET (negative case) and NULLS FIRST/LAST (correctness comparison between optimized and unoptimized paths). Made-with: Cursor
1. Collision-safe group keys: Replace hash-only `std::string` key with `UInt128` SipHash + exact column comparison via `unordered_multimap`. On hash match, compares actual key column values against stored groups in `result_columns`, eliminating the theoretical collision risk. 2. Collation-aware compare in Mode 2: `generateMode2` now checks for a `collator` in `SortColumnDescription` and uses `compareAtWithCollation` when present, ensuring correct ordering for string ORDER BY with explicit collation (e.g., COLLATE 'en'). 3. Mode 1 safety for argMin/argMax/any: Add `output_ordered_by_sort_key` flag to `TopKAggregateInfo`. Only min/max set it to true (their output is monotonically related to the sort key). The optimizer now only enables Mode 1 (early termination with sorted input) when this flag is true, preventing incorrect results for argMin/argMax where the output column differs from the sort key. 4. Extended test coverage: - Collation-sensitive ordering (COLLATE 'en') optimized vs reference - Mode 1 EXPLAIN PLAN + correctness with small LIMIT - argMin/argMax ASC/DESC correctness parity - Tie-heavy dataset (100 groups, deterministic values) - Multi-aggregate (max + argMax) correctness - EXPLAIN PLAN for argMin optimization Made-with: Cursor
The `.cpp` was out of sync with the `.h` header due to a file-write tool failure in the previous session. The header declared the new collision-safe API (`hashGroupKey`, `findGroupIndex`, `unordered_multimap<UInt128, size_t, UInt128Hash>`), but the `.cpp` still contained the old `serializeGroupKey` returning `std::string` with hash-only identity. This commit makes `.cpp` consistent with `.h`: - `hashGroupKey`: returns `UInt128` SipHash digest of key columns. - `findGroupIndex`: on hash match, performs exact column-by-column comparison against stored keys in `result_columns` via `compareAt`, eliminating the hash-collision correctness risk. - Both `consumeMode1` and `consumeMode2` now use the new `unordered_multimap::emplace` / `equal_range` API. Made-with: Cursor
|
Workflow [PR], commit [cdf1f2e] Summary: ❌
AI ReviewSummaryThis PR introduces a fused ClickHouse Rules
Final Verdict
|
Key changes: 1. Periodic threshold refresh: After each chunk, `refreshThresholdFromStates` calls `insertResultInto` for all groups' ORDER BY aggregate, uses `IColumn::getPermutation` with limit K to find the K-th best actual aggregate value, and sets a tight boundary. This raises the threshold from the ~1st percentile (first-seen values) to the ~99.99th percentile (actual group maxes), enabling 95-99% of subsequent rows to be skipped. Capped at `limit * 10000` groups to prevent O(N_groups) blowup. 2. Mode 2 gating: Only applies when a `__topKFilter` prewhere can be pushed into `ReadFromMergeTree` (numeric, non-nullable aggregate argument, no existing prewhere). Without prewhere, falls through to the standard Aggregator pipeline which is faster due to type-dispatched hashing. 3. Mode 1 gating fix: Correctly requires `output_ordered_by_sort_key` for Mode 1 (early termination), preventing incorrect results for `argMin`/ `argMax` whose output has different sort order than the sort key. Performance results on 10M rows, 100K groups: - Mode 2 unsorted (uniform): 7ms vs 13ms baseline (1.8x faster, was 3.7x slower) - Mode 2 skewed (50M rows): 9ms vs 26ms baseline (2.9x faster) - Mode 2 memory: 4 MiB vs 271 MiB (67x less) - High cardinality (5M+ groups): 4-5x slower (known limitation) - Non-MergeTree: zero regression (falls through to standard pipeline) Made-with: Cursor
…, comparator caching Major changes to the TopNAggregation unsorted-input path (Mode 2): - Replace old `consumeMode2` (find + insert, per-row threshold check from aggregate states, periodic `refreshThresholdFromStates`) with a simpler direct loop using `HashMap::emplace`, per-chunk threshold estimation from raw input data, and `TopKThresholdTracker::testAndSet`. - Accumulate keys in dedicated `mode2_accumulated_keys` columns instead of reusing `result_columns`, avoiding confusion between Mode 1 and Mode 2. - `generateMode2Partial` emits only local top-K groups as intermediate aggregate state columns (ColumnAggregateFunction), drastically reducing fan-in to the merge transform. - Rewrite `TopNAggregatingMergeTransform`: pre-compute `agg_state_offsets`, `key_column_indices`, `agg_column_indices`; use `ArenaPtr` instead of value `Arena`; inline hash/create/destroy instead of separate methods. - Simplify destructor to unconditionally sweep `group_states`. - Remove unused `refreshThresholdFromStates`, `isBelowThreshold`, `order_agg_arg_col_idx`, `threshold_active` members. Mode 1 improvements: - Use `MergingSortedTransform` to merge N sorted streams instead of `pipeline.resize(1)`, preserving parallelism up to the merge point. - Pass `order_arg_col_name` through `TopNAggregatingStep` for sort desc. Optimizer (`optimizeTopNAggregation`): - Add `group_by_matches_sorting_prefix` guard: skip Mode 2 when GROUP BY keys match a prefix of the table sorting key (standard Aggregator is faster in this case due to two-level hash table with sorted output). - Add `unqualifyColumnName` helper for robust key matching. `FunctionTopKFilter`: - Cache the built comparator function + converted threshold, keyed by a monotonic version counter on `TopKThresholdTracker`. Rebuilds only when the threshold actually changes, avoiding redundant `build` + `convertFieldToType`. - Fix comparator to use `lessOrEquals`/`greaterOrEquals` (was `less`/`greater`), so rows exactly equal to the threshold are kept (required for correctness when the boundary group's argument equals the threshold). `TopKThresholdTracker`: - Add `version` atomic counter, incremented on every `testAndSet` that actually updates the threshold. Enables lock-free staleness detection. Made-with: Cursor
Made-with: Cursor
1. Collision-safe group identity: replace `HashMap<UInt128, size_t>` (SipHash-only key) with `HashMapWithSavedHash<std::string_view, size_t>` backed by `IColumn::serializeValueIntoArena`. This gives exact key comparison with automatic arena rollback for non-new keys via `SerializedKeyHolder`, eliminating the theoretical hash-collision correctness risk. 2. Safe threshold pruning: remove the per-chunk row-level threshold update from `consumeMode2`. Row-level K-th values can be stricter than the true group-level K-th aggregate, which would incorrectly filter rows needed by groups in the final top K. The threshold is now updated only from group-level aggregates in `generateMode2Partial`, which is provably safe (a partial's K-th group aggregate is always <= the global K-th). 3. Fix Mode 2 gating: remove the `group_by_matches_sorting_prefix` check that blocked Mode 2 even when Mode 1 was inapplicable (e.g. GROUP BY matches the sorting key but the ORDER BY aggregate argument is a different column). The `!sorted_input` guard already suffices. 4. Add documentation comments explaining why Mode 2 is intentionally not applied as a generic unsorted fallback, and noting that Mode 2 memory is O(unique groups) without in-stream eviction. 5. Add stress test for large composite keys, skewed ties, and parallel Mode 2 threshold pruning. Made-with: Cursor
The test uses `COLLATE 'en'` which requires ICU. The fast test build is compiled without ICU, causing `SUPPORT_IS_DISABLED` error. Made-with: Cursor
Add a new `topn_aggregation_pruning_level` (UInt64, 0-2) setting to control Mode 2 optimizations independently: - Level 0: direct compute only (no threshold, no filter) - Level 1: + in-transform threshold pruning - Level 2: + dynamic `__topKFilter` prewhere pushdown The dynamic filter (level 2) now respects `use_top_k_dynamic_filtering`, which defaults to false, so it is no longer injected unconditionally. Re-add in-transform threshold pruning (`refreshThresholdFromStates`, `isBelowThreshold`) to `TopNAggregatingTransform` for level 1+ support. Test changes: - Add correctness tests for all three pruning levels on MergeTree table - Add EXPLAIN tests verifying plan differences across levels - Existing EXPLAIN prewhere test now explicitly sets `use_top_k_dynamic_filtering = 1` Made-with: Cursor
The static analyzer cannot prove that `enable_threshold_pruning` being true guarantees `order_arg_col` is non-null. Check the pointer directly instead of the boolean flag to satisfy clang-analyzer-core.NonNullParamChecker. Made-with: Cursor
- Setting description for `topn_aggregation_pruning_level` now explicitly states that level 2 requires `use_top_k_dynamic_filtering` to inject the prewhere, and that level 0 is not recommended (slower than baseline). - Mode 2 now requires `pruning_level >= 1` to activate. Level 0 (direct compute without threshold) is known to regress vs the standard Aggregator pipeline, so it should not silently activate. - EXPLAIN reference for level 0 updated: now shows the standard Aggregating + Sorting + Limit pipeline (no TopNAggregating). Made-with: Cursor
Pass 0 (no limit) to `MergingSortedTransform` instead of the user's LIMIT K. K refers to the number of distinct groups, not total rows. With the previous code, if K consecutive rows belonged to the same group, MergingSorted would stop after K rows and the downstream transform would see fewer than K distinct groups. The backpressure from `TopNAggregatingTransform::consumeMode1` calling `finishConsume` is sufficient to stop reading once K groups are found. Made-with: Cursor
Header: regroup members into logical sections (configuration, column indices, state layout, per-group state, Mode 1/2 columns, threshold). Implementation: extract shared helpers to eliminate duplication: - `computeStateLayout`: state offset/alignment (was duplicated in both constructors) - `findOrderByAggIndex`: ORDER BY aggregate lookup (was duplicated in both constructors and `initColumnIndices`) - `sortAndLimitColumns`: sort-permute-limit pattern (was triplicated in `generateMode2`, `generateMode2Partial`, merge `generate`) - `prepareArgColumnPtrs`: arg column pointer setup (was duplicated in `consumeMode1` and `consumeMode2`) Also make `consumeMode1` use the shared aggregate state helpers (`createAggregateStates`/`addRowToAggregateStates`/etc.) instead of inline state management, and deduplicate the merge loop in `TopNAggregatingMergeTransform::consume`. Net: -20 lines, fewer copy-paste surfaces for future changes. Made-with: Cursor
Avoid large-`LIMIT` regressions by adding a conservative `topn_aggregation_max_limit` gate, and add `EXPLAIN` coverage that verifies default fallback and explicit override behavior. Also reduce threshold refresh overhead in `TopNAggregatingTransform` with adaptive cadence and document `GroupState` intent for future per-group metadata. Made-with: Cursor
Add `buildThresholdKeepMask` to `TopNAggregatingTransform` which uses `IColumn::compareColumn` for vectorized threshold filtering in Mode 2 instead of per-row `isBelowThreshold` calls. This amortizes virtual call overhead and enables SIMD-friendly comparison paths. Add "Design tradeoffs" and "Future improvements" sections to `topngroupby.md` covering: serialized-key HashMap vs type-dispatched tables, threshold refresh cost and adaptive backoff, dynamic filter effectiveness by data distribution, partial top-K correctness, and planned improvements (type-dispatched tables, group eviction, mixed-aggregate two-pass, distributed support, etc.). Made-with: Cursor
Made-with: Cursor
The test uses EXPLAIN to verify query plan structure, which changes when ParallelReplicas is enabled (adds MergingAggregated/Union/ ReadFromRemoteParallelReplicas nodes). This caused the reference output mismatch in the ParallelReplicas stateless test CI job. Made-with: Cursor
|
@cursor review |
…lumns The companion aggregate compatibility loop only checked `determined_by_first_row_direction` but never verified that each companion's determining argument column matches the ORDER BY aggregate's argument column. This allowed, e.g., `argMax(payload, val)` alongside `max(ts)` (val != ts) to be incorrectly optimized: Mode 1 only processes the first row per group sorted by `ts`, missing the row with max `val`; Mode 2 threshold pruning on `ts` can skip rows carrying the true max `val`. Add a check that every non-`any` companion aggregate's determining argument (`argument_names.back()`) equals the ORDER BY aggregate's determining argument. `any()` (`INT_MAX`) remains exempt since it accepts any row by definition. Made-with: Cursor
…structor The constructor was changed to accept `SortColumnDescription` (for collation/nullable support) but the call site in `optimizeTopNAggregation` still passed a bare `int sort_direction`. Co-Authored-By: Claude <[email protected]>
At topn_aggregation_pruning_level = 2, it uses both. The
When those conditions aren't met, only the PREWHERE filtering is active. Ref: #99033 |
…sHistory` The settings were registered under 26.3 but the current development version is 26.4, causing `02995_new_settings_history` to fail. ClickHouse#98607 Co-Authored-By: Claude <[email protected]>
| /// Mode 1 requires output_ordered_by_sort_key because early termination depends | ||
| /// on the aggregate result being monotonically ordered with the sort key. For example, | ||
| /// argMin(payload, ts) returns payload which has different ordering than ts. | ||
| if (read_from_mt && order_info.output_ordered_by_sort_key) |
There was a problem hiding this comment.
Mode 1 is enabled based only on sort direction, but this path does not validate SQL ordering semantics beyond direction.
At this point we only check output_ordered_by_sort_key and call requestReadingInOrder(1, required_direction, limit_value), while the original SortingStep can also carry COLLATE and NULLS FIRST/LAST semantics (sort_desc[0].collator / nulls_direction).
If query ORDER BY uses collation (and potentially nullable semantics), physical MergeTree read order is not guaranteed to match that order, so early termination after K groups can return wrong results.
Please add a strict guard before enabling sorted_input: at minimum reject this optimization when sort_desc[0].collator is set; and for nullable argument columns, only enable when null ordering is proven equivalent to storage read order.
- Fix `03470_topn_aggregation_parallel_replicas` test failure in "old analyzer" CI config: move `allow_experimental_analyzer = 1` from subquery SETTINGS to session-level SET, because changing this setting inside a subquery is disallowed when the top-level value differs. - Fix potential overflow in `maybeRefreshThreshold`: replace `groups > limit * 10000` with `groups / 10000 > limit` to avoid wrapping when `limit` is large. ClickHouse#98607 Co-Authored-By: Claude <[email protected]>
…into murphy_issue_75098
|
Briefly looking at the code, it seemed to me that mode1 and mode2 does not share too much logic in common (maybe a small base class for common code?). I think it will be more readable if we have two different transforms (with good descriptive names) instead. Then, we can dispatch them via |
| @@ -0,0 +1,648 @@ | |||
| -- Tags: no-random-settings, no-fasttest, no-parallel-replicas | |||
There was a problem hiding this comment.
-- Tags: no-random-settings
I understand why we have this setting to make sure that EXPLAIN queries do not become flaky. However, random settings are quite useful in finding bugs.
That's why I propose that we can have two test file: this one and equivalent 03467_topn_aggregation_explain.sql. 03467_topn_aggregation.sql can have all the correctness tests. Then, for each test 03467_topn_aggregation_explain.sql will verify that TopNAggregatingTransform is actually used. We can keep no-random-settings only for the explain test.
no-fasttest
I think we can drop it. I see no reason to keep it.
The monolithic `TopNAggregatingTransform` handled both sorted-input early-termination (Mode 1) and hash-based direct aggregation (Mode 2) with runtime dispatch. These modes share little logic beyond column index mapping and aggregate state lifecycle. Split into three classes: - `TopNAggregatingTransformBase` -- base class with shared infrastructure (column indices, key serialization, aggregate state helpers) - `TopNSortedAggregatingTransform` -- Mode 1: sorted input, stops after K distinct groups - `TopNDirectAggregatingTransform` -- Mode 2: hash aggregation with optional threshold pruning, supports partial output for parallel workers `TopNAggregatingStep` now dispatches to the appropriate transform class at construction time, which is natural since it already knows the mode. `TopNAggregatingMergeTransform` is unchanged. Made-with: Cursor
- Fix NaN direction hint: use `nulls_direction` from `sort_description` consistently in `getSortPermutation`, `isBelowThreshold`, and `buildThresholdKeepMask` instead of `sort_direction`, matching the final output sorting semantics. - Add Mode 1 collation guard: reject sorted-input early termination when `sort_desc[0].collator` is set, because physical MergeTree order may not match collation order. - Clarify `topn_aggregation_pruning_level` docs: level 0 disables Mode 2 entirely (not "direct compute only"), only Mode 1 can apply. - Split `03467_topn_aggregation` test into correctness file (allows random settings) and EXPLAIN file (keeps `no-random-settings`); drop `no-fasttest` tag. Made-with: Cursor
- Add `chassert` to verify column type consistency between input and boundary columns in `isBelowThreshold` and `buildThresholdKeepMask`. - Guard `refreshThresholdFromStates` against `limit==0` to prevent unsigned underflow in `std::min(limit, n) - 1`. - Add missing "max + any" reference queries in all three test files to verify companion aggregate correctness. - Add negative EXPLAIN cases for `min + DESC` (wrong direction) and multi-column `ORDER BY` to improve optimizer coverage. - Reset `allow_suspicious_low_cardinality_types` after table creation to avoid session pollution in test files. - Fix misleading "ties" comment (aggregate values are actually distinct). Made-with: Cursor
- Fix `TopKThresholdTracker::compareFields` to treat NaN the same as column-level code: NaN is ordered via `nulls_direction` instead of being treated as equal to all values by `Field::operator<`. This prevents the shared threshold from disagreeing with per-row pruning in `TopNDirectAggregatingTransform` and `FunctionTopKFilter`. - Reject Mode 1 (sorted-input early termination) when the determining aggregate argument is nullable. MergeTree physical NULL ordering may differ from the query's `NULLS FIRST`/`NULLS LAST`, so early termination after K groups could return wrong results. - Expand `optimize_topn_aggregation` setting description to document Mode 1 vs Mode 2 behavior, eligibility requirements, and current limitations (nullable, collation, float/NaN caveats). - Add regression tests for Float64 NaN with explicit `NULLS FIRST` and `NULLS LAST` under Mode 2, and for nullable-primary-key tables under Mode 1. Made-with: Cursor
Route floating-point types through the `executeGeneral` path in `FunctionTopKFilter` instead of the vectorized `greaterOrEquals` / `lessOrEquals` fast path. The vectorized comparisons return `false` for NaN, which can incorrectly drop rows that the transform-level pruning (using `compareColumn` with `nulls_direction`) would keep. Add regression tests for Float64 NaN at `topn_aggregation_pruning_level = 2` with `use_top_k_dynamic_filtering = 1`. Made-with: Cursor
Document the TopN aggregation optimization in the GROUP BY reference page: Mode 1 vs Mode 2 behavior, eligibility requirements, current limitations, example query, and related settings. Made-with: Cursor
The correctness test uses `COLLATE 'en'` which requires ICU, not available in the fast test environment. Made-with: Cursor
… columns The flaky check with random settings (e.g. `max_block_size=1`) can produce `ColumnConst` or other wrapped column types that aggregate functions cannot handle via `assert_cast`. Use `convertToFullIfNeeded` (which handles Const, Replicated, Sparse, and LowCardinality wrappers) instead of only converting LowCardinality, matching what the standard `Aggregator` does. Made-with: Cursor
- Unwrap key columns with `convertToFullIfNeeded` (like aggregate arg columns) so that `insertFrom` receives concrete column types instead of `ColumnConst`/`ColumnSparse` wrappers that cause `assert_cast` failures. Applied to Mode 1, Mode 2, and merge transform. - Add try/catch rollback in `createAggregateStates` matching the standard `Aggregator` pattern: if `create` throws for aggregate j, destroy already-created states 0..j-1. - Add try/catch in Mode 1 `consume` around temporary state usage so `destroyAggregateStates` is called even if `addRowToAggregateStates` or `insertResultsFromStates` throws. - Move `++num_groups` before `addRowToAggregateStates` in Mode 2 so `accumulated_keys.size()` and `num_groups` stay consistent if `add` throws after `group_states.push_back`. Made-with: Cursor
The nullable key test had two groups (b and e) both with NULL max(val), making their relative order in NULLS FIRST non-deterministic. Changed e's value from NULL to a date so all aggregate values are distinct. Made-with: Cursor
| addSettingsChanges(settings_changes_history, "26.4", | ||
| { | ||
| {"optimize_topn_aggregation", false, false, "New setting to enable fused TopN aggregation optimization for GROUP BY ... ORDER BY aggregate LIMIT K queries."}, | ||
| {"topn_aggregation_pruning_level", 2, 2, "Controls Mode 2 pruning optimizations: 0=direct compute only, 1=+in-transform threshold, 2=+dynamic filter pushdown."}, |
There was a problem hiding this comment.
The description here says topn_aggregation_pruning_level uses 0=direct compute only, but both the setting docs (Settings.cpp) and optimizer logic (optimizeTopNAggregation) treat level 0 as "Mode 2 disabled entirely". Could you align this reason string to the real behavior to avoid confusing users reading compatibility change history?
The NaN test had two groups (b and d) both with nan, causing non-deterministic ordering under random settings. Changed d from nan to 150 so all aggregate values are distinct. Made-with: Cursor
LLVM Coverage Report
Changed lines: 84.91% (1035/1219) | lost baseline coverage: 4 line(s) · Uncovered code |
|
|
||
| - [GROUP BY optimization](/sql-reference/statements/select/group-by#group-by-optimization-depending-on-table-sorting-key) | ||
| )", 0) \ | ||
| DECLARE(Bool, optimize_topn_aggregation, false, R"( |
There was a problem hiding this comment.
Temporarily let's enable it by default to see how it does with the existing performance tests.
| DECLARE(Bool, optimize_topn_aggregation, false, R"( | |
| DECLARE(Bool, optimize_topn_aggregation, true, R"( |
Summary
This PR introduces a fused
TopNAggregatingquery plan operator for queries like:It replaces the
AggregatingStep -> SortingStep -> LimitStepchain with a singleTopNAggregatingStepand supports two execution modes:Kunique groups.__topKFiltertoPREWHEREfor storage-level skipping.Closes #75098
Newly introduced settings
optimize_topn_aggregation(default0): enablesTopNAggregationrewrite.topn_aggregation_pruning_level(default2): controls Mode 2 pruning stack:0: direct compute only1: + in-transform threshold pruning2: + dynamic__topKFilterpushdown (whenuse_top_k_dynamic_filtering=1)topn_aggregation_max_limit(default1000): conservative Mode 2 gate; for largerLIMITvalues optimizer falls back to standard aggregation/sort pipeline to avoid known large-Kregressions.Performance (short)
topn_aggregation_max_limit=1000defaults to conservative fallback and avoids worst cases.Changelog category:
Changelog entry:
Added
optimize_topn_aggregationfor fusedGROUP BY ... ORDER BY aggregate LIMIT Kexecution, withtopn_aggregation_pruning_levelcontrols for Mode 2 pruning andtopn_aggregation_max_limitfor conservative large-Kfallback.Documentation entry for user-facing changes
Made with Cursor
Note
Medium Risk
Adds a new query-plan rewrite and execution operator (plus dynamic PREWHERE injection) that can change performance/correctness for eligible queries, though it is guarded by a new setting defaulting to off. Risk is mainly in plan-pattern matching and the new threshold-pruning path across nullable/collation edge cases (partly gated/disabled).
Overview
Adds an optional TopN aggregation optimization (
optimize_topn_aggregation) that rewritesLimit -> Sorting -> Aggregatinginto a singleTopNAggregatingStepforGROUP BY ... ORDER BY <aggregate> LIMIT Kwhen all aggregates are compatible.Extends
IAggregateFunctionwithgetTopKAggregateInfo()and implements it formin/max,any, andargMin/argMaxto drive the rewrite and mode selection (sorted-input early termination vs unsorted threshold pruning with optional__topKFilterPREWHERE pushdown), controlled by new settingstopn_aggregation_pruning_levelandtopn_aggregation_max_limit.Updates
__topKFilterto use inclusive comparisons, adds versioned/cached comparator building backed by a newTopKThresholdTrackerversion counter, exposesLimitStep::getOffset()for rewrite gating, and adds stateless correctness/stress tests covering positive and negative cases.Written by Cursor Bugbot for commit b2e47a2. This will update automatically on new commits. Configure here.