Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456
Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456tuanpach wants to merge 3 commits intoClickHouse:masterfrom
Conversation
…shard join When both `query_plan_read_in_order_through_join = 1` and `query_plan_join_shard_by_pk_ranges = 1` are active, two optimizations interact incorrectly: 1. `optimizeReadInOrder` runs first and converts the outer `ORDER BY` from `FullSortingStep` to `FinishSortingStep`, trusting that the MergeTree read will produce data in storage-key order through the join. 2. `optimizeJoinByShards` runs afterward and splits the MergeTree read into PK-range layers via `splitIntersectingPartsRangesIntoLayers`. This splitter intentionally includes the border granule in both adjacent layers so that overlapping values can be merge-resolved within each shard. However, no cross-shard merge is performed for the outer `ORDER BY`, so each shard emits rows in DESC storage order independently and the outputs are concatenated, producing wrong results (e.g. `4,3,2,1,0,9,8,7,6,5` instead of `0..9` for a table with `ORDER BY (c0 DESC)`). The root issue is that `optimizeReadInOrder` cannot know that a later pass will introduce overlapping shard boundaries that break its ordering assumption. Fix: in `buildInputOrderInfo`, when the optimization traverses through a JOIN (`joins_to_keep_in_order` is non-empty) and `query_plan_join_shard_by_pk_ranges` is also enabled, skip the sort elimination. This preserves the outer `ORDER BY` as a `FullSortingStep` so the data is correctly sorted after the sharded join. Adds regression test `04075_shard_join_read_in_order_reverse_key` that pins all settings required to trigger the bug deterministically, including `enable_join_runtime_filters = 0` (prevents a runtime filter step from accidentally blocking `findReadingStep` in `optimizeJoinByShards`) and `merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0` (disables CI chaos injection that would otherwise randomize range splits). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
Workflow [PR], commit [115b824] Summary: ❌
AI ReviewSummaryThis PR fixes a real correctness issue where PR Metadata
Suggested replacement:
ClickHouse Rules
Final VerdictStatus: Minimum required actions:
|
…overloads The previous fix only guarded buildInputOrderInfo(SortingStep &...). The same read-in-order-through-join traversal is also performed by the AggregatingStep and DistinctStep overloads, which would make an ordering assumption that optimizeJoinByShards then invalidates by splitting the MergeTree read into overlapping PK-range layers via splitIntersectingPartsRangesIntoLayers. Add the same early-return guard in the ReadFromMergeTree branch of both overloads. Note that MergingAggregatedStep and the final DistinctStep already produce correct results even with overlapping shard boundaries (two-phase execution handles the border granule duplication), so these guards are defensive: they prevent the optimizer from building a plan on a per-shard ordering assumption that no longer holds after sharding. Extend the regression test with queries covering both paths. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…anges is set The previous guard returned `nullptr` before `requestReadingInOrder` was called, which prevented both the `ReadType: InOrder` annotation on the `MergeTree` scan and ORDER BY elision. That was too aggressive: `ReadType: InOrder` is always safe because it merely requests sorted reading within each part; only ORDER BY elision is unsafe when `optimizeJoinByShards` may later split the scan into overlapping PK-range layers. Move the `query_plan_join_shard_by_pk_ranges` guard to after `requestReadingInOrder` and `keepLeftPipelineInOrder` are called, so the scan gets `ReadType: InOrder` for performance while the sort step is preserved for correctness. The `AggregatingStep` and `DistinctStep` overloads keep their early return because applying aggregation-in-order or distinct-in-order on unsorted data (from overlapping shard layers) can produce wrong results, not just a performance regression. Fixes the stateless test `04024_pr_read_in_order_through_join` which was failing when the randomised setting `query_plan_join_shard_by_pk_ranges=True` was injected by the test runner. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
| for (auto * join_step : find_reading_ctx.joins_to_keep_in_order) | ||
| join_step->keepLeftPipelineInOrder(/* disable_squashing */ true); | ||
|
|
||
| /// When query_plan_join_shard_by_pk_ranges is enabled, optimizeJoinByShards runs |
There was a problem hiding this comment.
Changelog category looks mismatched for this change. This PR fixes a user-visible query correctness issue (wrong result ordering with query_plan_read_in_order_through_join + query_plan_join_shard_by_pk_ranges), so it should be Bug Fix rather than CI Fix or Improvement.
LLVM Coverage Report
Changed lines: 100.00% (20/20) · Uncovered code |
When both
query_plan_read_in_order_through_join = 1andquery_plan_join_shard_by_pk_ranges = 1are active, two optimizations interact incorrectly:optimizeReadInOrderruns first and converts the outerORDER BYfromFullSortingSteptoFinishSortingStep, trusting that the MergeTree read will produce data in storage-key order through the join.optimizeJoinByShardsruns afterward and splits the MergeTree read into PK-range layers viasplitIntersectingPartsRangesIntoLayers. This splitter intentionally includes the border granule in both adjacent layers so that overlapping values can be merge-resolved within each shard. However, no cross-shard merge is performed for the outerORDER BY, so each shard emits rows in DESC storage order independently and the outputs are concatenated, producing wrong results (e.g.4,3,2,1,0,9,8,7,6,5instead of0..9for a table withORDER BY (c0 DESC)).The root issue is that
optimizeReadInOrdercannot know that a later pass will introduce overlapping shard boundaries that break its ordering assumption.Fix: in
buildInputOrderInfo, when the optimization traverses through a JOIN (joins_to_keep_in_orderis non-empty) andquery_plan_join_shard_by_pk_rangesis also enabled, skip the sort elimination. This preserves the outerORDER BYas aFullSortingStepso the data is correctly sorted after the sharded join.Adds regression test
04075_shard_join_read_in_order_reverse_keythat pins all settings required to trigger the bug deterministically, includingenable_join_runtime_filters = 0(prevents a runtime filter step from accidentally blockingfindReadingStepinoptimizeJoinByShards) andmerge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0(disables CI chaos injection that would otherwise randomize range splits).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join
Closes #100870
Documentation entry for user-facing changes