Skip to content

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456

Open
tuanpach wants to merge 3 commits intoClickHouse:masterfrom
tuanpach:fix-join-shard-read-in-order-reverse-key
Open

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join#101456
tuanpach wants to merge 3 commits intoClickHouse:masterfrom
tuanpach:fix-join-shard-read-in-order-reverse-key

Conversation

@tuanpach
Copy link
Copy Markdown
Member

@tuanpach tuanpach commented Apr 1, 2026

When both query_plan_read_in_order_through_join = 1 and query_plan_join_shard_by_pk_ranges = 1 are active, two optimizations interact incorrectly:

  1. optimizeReadInOrder runs first and converts the outer ORDER BY from FullSortingStep to FinishSortingStep, trusting that the MergeTree read will produce data in storage-key order through the join.

  2. optimizeJoinByShards runs afterward and splits the MergeTree read into PK-range layers via splitIntersectingPartsRangesIntoLayers. This splitter intentionally includes the border granule in both adjacent layers so that overlapping values can be merge-resolved within each shard. However, no cross-shard merge is performed for the outer ORDER BY, so each shard emits rows in DESC storage order independently and the outputs are concatenated, producing wrong results (e.g. 4,3,2,1,0,9,8,7,6,5 instead of 0..9 for a table with ORDER BY (c0 DESC)).

The root issue is that optimizeReadInOrder cannot know that a later pass will introduce overlapping shard boundaries that break its ordering assumption.

Fix: in buildInputOrderInfo, when the optimization traverses through a JOIN (joins_to_keep_in_order is non-empty) and query_plan_join_shard_by_pk_ranges is also enabled, skip the sort elimination. This preserves the outer ORDER BY as a FullSortingStep so the data is correctly sorted after the sharded join.

Adds regression test 04075_shard_join_read_in_order_reverse_key that pins all settings required to trigger the bug deterministically, including enable_join_runtime_filters = 0 (prevents a runtime filter step from accidentally blocking findReadingStep in optimizeJoinByShards) and merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0 (disables CI chaos injection that would otherwise randomize range splits).

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix incorrect ORDER BY elision when read-in-order-through-join meets shard join

Closes #100870

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

…shard join

When both `query_plan_read_in_order_through_join = 1` and
`query_plan_join_shard_by_pk_ranges = 1` are active, two optimizations interact
incorrectly:

1. `optimizeReadInOrder` runs first and converts the outer `ORDER BY` from
   `FullSortingStep` to `FinishSortingStep`, trusting that the MergeTree read
   will produce data in storage-key order through the join.

2. `optimizeJoinByShards` runs afterward and splits the MergeTree read into
   PK-range layers via `splitIntersectingPartsRangesIntoLayers`. This splitter
   intentionally includes the border granule in both adjacent layers so that
   overlapping values can be merge-resolved within each shard. However, no
   cross-shard merge is performed for the outer `ORDER BY`, so each shard
   emits rows in DESC storage order independently and the outputs are
   concatenated, producing wrong results (e.g. `4,3,2,1,0,9,8,7,6,5`
   instead of `0..9` for a table with `ORDER BY (c0 DESC)`).

The root issue is that `optimizeReadInOrder` cannot know that a later pass
will introduce overlapping shard boundaries that break its ordering assumption.

Fix: in `buildInputOrderInfo`, when the optimization traverses through a JOIN
(`joins_to_keep_in_order` is non-empty) and `query_plan_join_shard_by_pk_ranges`
is also enabled, skip the sort elimination. This preserves the outer `ORDER BY`
as a `FullSortingStep` so the data is correctly sorted after the sharded join.

Adds regression test `04075_shard_join_read_in_order_reverse_key` that pins all
settings required to trigger the bug deterministically, including
`enable_join_runtime_filters = 0` (prevents a runtime filter step from
accidentally blocking `findReadingStep` in `optimizeJoinByShards`) and
`merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0`
(disables CI chaos injection that would otherwise randomize range splits).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 1, 2026

Workflow [PR], commit [115b824]

Summary:

job_name test_name status info comment
Stateless tests (amd_msan, WasmEdge, parallel, 1/2) failure
03711_read_in_order_through_join FAIL cidb
Integration tests (amd_binary, 2/5) failure
test_storage_kafka/test_batch_fast.py::test_kafka_commit_on_block_write[generate_old_create_table_query] FAIL cidb, issue
Integration tests (amd_llvm_coverage, 4/5) failure
test_overcommit_tracker/test.py::test_user_overcommit FAIL cidb
Stress test (arm_msan) failure
Server died FAIL cidb
MemorySanitizer: use-of-uninitialized-value (STID: 1003-358c) FAIL cidb, issue

AI Review

Summary

This PR fixes a real correctness issue where optimizeReadInOrder could incorrectly elide outer sorting before optimizeJoinByShards introduces overlapping shard boundaries. The added guards in buildInputOrderInfo for SortingStep, AggregatingStep, and DistinctStep, plus the new stateless regression test, look consistent with the described failure mode and cover the interaction paths. I did not find additional correctness, safety, or performance blockers in the patch itself.

PR Metadata

Changelog category is incorrect for this code change. This PR fixes query-result correctness in optimizer logic, not CI behavior.

Suggested replacement:

  • Changelog category: Bug Fix
  • Changelog entry: Fix incorrect ORDER BY elimination when \ query_plan_read_in_order_through_join and query_plan_join_shard_by_pk_ranges are enabled together.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality ⚠️ Changelog category says CI Fix or Improvement but change is optimizer correctness fix
Safe rollout
Compilation time
Final Verdict

Status: ⚠️ Request changes

Minimum required actions:

  • Update PR template metadata: replace Changelog category with Bug Fix.

@clickhouse-gh clickhouse-gh bot added the pr-ci label Apr 1, 2026
tuanpach and others added 2 commits April 1, 2026 07:12
…overloads

The previous fix only guarded buildInputOrderInfo(SortingStep &...). The same
read-in-order-through-join traversal is also performed by the AggregatingStep
and DistinctStep overloads, which would make an ordering assumption that
optimizeJoinByShards then invalidates by splitting the MergeTree read into
overlapping PK-range layers via splitIntersectingPartsRangesIntoLayers.

Add the same early-return guard in the ReadFromMergeTree branch of both
overloads. Note that MergingAggregatedStep and the final DistinctStep already
produce correct results even with overlapping shard boundaries (two-phase
execution handles the border granule duplication), so these guards are
defensive: they prevent the optimizer from building a plan on a per-shard
ordering assumption that no longer holds after sharding.

Extend the regression test with queries covering both paths.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…anges is set

The previous guard returned `nullptr` before `requestReadingInOrder` was called,
which prevented both the `ReadType: InOrder` annotation on the `MergeTree` scan
and ORDER BY elision.  That was too aggressive: `ReadType: InOrder` is always
safe because it merely requests sorted reading within each part; only ORDER BY
elision is unsafe when `optimizeJoinByShards` may later split the scan into
overlapping PK-range layers.

Move the `query_plan_join_shard_by_pk_ranges` guard to after
`requestReadingInOrder` and `keepLeftPipelineInOrder` are called, so the scan
gets `ReadType: InOrder` for performance while the sort step is preserved for
correctness.

The `AggregatingStep` and `DistinctStep` overloads keep their early return
because applying aggregation-in-order or distinct-in-order on unsorted data
(from overlapping shard layers) can produce wrong results, not just a
performance regression.

Fixes the stateless test `04024_pr_read_in_order_through_join` which was
failing when the randomised setting `query_plan_join_shard_by_pk_ranges=True`
was injected by the test runner.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
for (auto * join_step : find_reading_ctx.joins_to_keep_in_order)
join_step->keepLeftPipelineInOrder(/* disable_squashing */ true);

/// When query_plan_join_shard_by_pk_ranges is enabled, optimizeJoinByShards runs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog category looks mismatched for this change. This PR fixes a user-visible query correctness issue (wrong result ordering with query_plan_read_in_order_through_join + query_plan_join_shard_by_pk_ranges), so it should be Bug Fix rather than CI Fix or Improvement.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 1, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 100.00% (20/20) · Uncovered code

Full report · Diff report

@antaljanosbenjamin antaljanosbenjamin mentioned this pull request Apr 1, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: 03668_shard_join_in_reverse_order

1 participant