Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join by tuanpach · Pull Request #101456 · ClickHouse/ClickHouse

tuanpach · 2026-04-01T06:26:35Z

When both query_plan_read_in_order_through_join = 1 and query_plan_join_shard_by_pk_ranges = 1 are active, two optimizations interact incorrectly:

optimizeReadInOrder runs first and converts the outer ORDER BY from FullSortingStep to FinishSortingStep, trusting that the MergeTree read will produce data in storage-key order through the join.
optimizeJoinByShards runs afterward and splits the MergeTree read into PK-range layers via splitIntersectingPartsRangesIntoLayers. This splitter intentionally includes the border granule in both adjacent layers so that overlapping values can be merge-resolved within each shard. However, no cross-shard merge is performed for the outer ORDER BY, so each shard emits rows in DESC storage order independently and the outputs are concatenated, producing wrong results (e.g. 4,3,2,1,0,9,8,7,6,5 instead of 0..9 for a table with ORDER BY (c0 DESC)).

The root issue is that optimizeReadInOrder cannot know that a later pass will introduce overlapping shard boundaries that break its ordering assumption.

Fix: in buildInputOrderInfo, when the optimization traverses through a JOIN (joins_to_keep_in_order is non-empty) and query_plan_join_shard_by_pk_ranges is also enabled, skip the sort elimination. This preserves the outer ORDER BY as a FullSortingStep so the data is correctly sorted after the sharded join.

Adds regression test 04075_shard_join_read_in_order_reverse_key that pins all settings required to trigger the bug deterministically, including enable_join_runtime_filters = 0 (prevents a runtime filter step from accidentally blocking findReadingStep in optimizeJoinByShards) and merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0 (disables CI chaos injection that would otherwise randomize range splits).

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join

Closes #100870

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

…shard join When both `query_plan_read_in_order_through_join = 1` and `query_plan_join_shard_by_pk_ranges = 1` are active, two optimizations interact incorrectly: 1. `optimizeReadInOrder` runs first and converts the outer `ORDER BY` from `FullSortingStep` to `FinishSortingStep`, trusting that the MergeTree read will produce data in storage-key order through the join. 2. `optimizeJoinByShards` runs afterward and splits the MergeTree read into PK-range layers via `splitIntersectingPartsRangesIntoLayers`. This splitter intentionally includes the border granule in both adjacent layers so that overlapping values can be merge-resolved within each shard. However, no cross-shard merge is performed for the outer `ORDER BY`, so each shard emits rows in DESC storage order independently and the outputs are concatenated, producing wrong results (e.g. `4,3,2,1,0,9,8,7,6,5` instead of `0..9` for a table with `ORDER BY (c0 DESC)`). The root issue is that `optimizeReadInOrder` cannot know that a later pass will introduce overlapping shard boundaries that break its ordering assumption. Fix: in `buildInputOrderInfo`, when the optimization traverses through a JOIN (`joins_to_keep_in_order` is non-empty) and `query_plan_join_shard_by_pk_ranges` is also enabled, skip the sort elimination. This preserves the outer `ORDER BY` as a `FullSortingStep` so the data is correctly sorted after the sharded join. Adds regression test `04075_shard_join_read_in_order_reverse_key` that pins all settings required to trigger the bug deterministically, including `enable_join_runtime_filters = 0` (prevents a runtime filter step from accidentally blocking `findReadingStep` in `optimizeJoinByShards`) and `merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0` (disables CI chaos injection that would otherwise randomize range splits). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

clickhouse-gh · 2026-04-01T06:27:17Z

Workflow [PR], commit [115b824]

Summary: ❌

job_name	test_name	status	info
Stateless tests (amd_msan, WasmEdge, parallel, 1/2)		failure
	03711_read_in_order_through_join	FAIL	cidb
Integration tests (amd_binary, 2/5)		failure
	test_storage_kafka/test_batch_fast.py::test_kafka_commit_on_block_write[generate_old_create_table_query]	FAIL	cidb, issue
Integration tests (amd_llvm_coverage, 4/5)		failure
	test_overcommit_tracker/test.py::test_user_overcommit	FAIL	cidb
Stress test (arm_msan)		failure
	Server died	FAIL	cidb
	MemorySanitizer: use-of-uninitialized-value (STID: 1003-358c)	FAIL	cidb, issue

AI Review

Summary

This PR fixes a real correctness issue where optimizeReadInOrder could incorrectly elide outer sorting before optimizeJoinByShards introduces overlapping shard boundaries. The added guards in buildInputOrderInfo for SortingStep, AggregatingStep, and DistinctStep, plus the new stateless regression test, look consistent with the described failure mode and cover the interaction paths. I did not find additional correctness, safety, or performance blockers in the patch itself.

PR Metadata

Changelog category is incorrect for this code change. This PR fixes query-result correctness in optimizer logic, not CI behavior.

Suggested replacement:

Changelog category: Bug Fix
Changelog entry: Fix incorrect ORDER BY elimination when \ query_plan_read_in_order_through_join and query_plan_join_shard_by_pk_ranges are enabled together.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	⚠️	`Changelog category` says `CI Fix or Improvement` but change is optimizer correctness fix
Safe rollout	✅
Compilation time	✅

Final Verdict

Status: ⚠️ Request changes

Minimum required actions:

Update PR template metadata: replace Changelog category with Bug Fix.

src/Processors/QueryPlan/Optimizations/optimizeReadInOrder.cpp

…overloads The previous fix only guarded buildInputOrderInfo(SortingStep &...). The same read-in-order-through-join traversal is also performed by the AggregatingStep and DistinctStep overloads, which would make an ordering assumption that optimizeJoinByShards then invalidates by splitting the MergeTree read into overlapping PK-range layers via splitIntersectingPartsRangesIntoLayers. Add the same early-return guard in the ReadFromMergeTree branch of both overloads. Note that MergingAggregatedStep and the final DistinctStep already produce correct results even with overlapping shard boundaries (two-phase execution handles the border granule duplication), so these guards are defensive: they prevent the optimizer from building a plan on a per-shard ordering assumption that no longer holds after sharding. Extend the regression test with queries covering both paths. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…anges is set The previous guard returned `nullptr` before `requestReadingInOrder` was called, which prevented both the `ReadType: InOrder` annotation on the `MergeTree` scan and ORDER BY elision. That was too aggressive: `ReadType: InOrder` is always safe because it merely requests sorted reading within each part; only ORDER BY elision is unsafe when `optimizeJoinByShards` may later split the scan into overlapping PK-range layers. Move the `query_plan_join_shard_by_pk_ranges` guard to after `requestReadingInOrder` and `keepLeftPipelineInOrder` are called, so the scan gets `ReadType: InOrder` for performance while the sort step is preserved for correctness. The `AggregatingStep` and `DistinctStep` overloads keep their early return because applying aggregation-in-order or distinct-in-order on unsorted data (from overlapping shard layers) can produce wrong results, not just a performance regression. Fixes the stateless test `04024_pr_read_in_order_through_join` which was failing when the randomised setting `query_plan_join_shard_by_pk_ranges=True` was injected by the test runner. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

clickhouse-gh · 2026-04-01T08:26:55Z

src/Processors/QueryPlan/Optimizations/optimizeReadInOrder.cpp

            for (auto * join_step : find_reading_ctx.joins_to_keep_in_order)
                join_step->keepLeftPipelineInOrder(/* disable_squashing */ true);
+
+            /// When query_plan_join_shard_by_pk_ranges is enabled, optimizeJoinByShards runs


Changelog category looks mismatched for this change. This PR fixes a user-visible query correctness issue (wrong result ordering with query_plan_read_in_order_through_join + query_plan_join_shard_by_pk_ranges), so it should be Bug Fix rather than CI Fix or Improvement.

clickhouse-gh · 2026-04-01T10:58:52Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.00%	84.00%	+0.00%
Functions	90.90%	90.90%	+0.00%
Branches	76.50%	76.50%	+0.00%

Changed lines: 100.00% (20/20) · Uncovered code

Full report · Diff report

clickhouse-gh bot added the pr-ci label Apr 1, 2026

clickhouse-gh bot reviewed Apr 1, 2026

View reviewed changes

src/Processors/QueryPlan/Optimizations/optimizeReadInOrder.cpp Outdated Show resolved Hide resolved

tuanpach and others added 2 commits April 1, 2026 07:12

clickhouse-gh bot reviewed Apr 1, 2026

View reviewed changes

antaljanosbenjamin mentioned this pull request Apr 1, 2026

Auto spilling join #97813

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456
tuanpach wants to merge 3 commits intoClickHouse:masterfrom
tuanpach:fix-join-shard-read-in-order-reverse-key

tuanpach commented Apr 1, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

clickhouse-gh bot Apr 1, 2026

Uh oh!

clickhouse-gh bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tuanpach commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

ClickHouse Rules

Final Verdict

Uh oh!

Uh oh!

clickhouse-gh bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot commented Apr 1, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tuanpach commented Apr 1, 2026 •

edited

Loading

clickhouse-gh bot commented Apr 1, 2026 •

edited

Loading