fix: validate inter-file ordering in eq_properties()#20329
fix: validate inter-file ordering in eq_properties()#20329adriangb merged 6 commits intoapache:mainfrom
Conversation
|
On main: |
|
@zhuqi-lucas or @xudong963 could you take a look at this? |
…rect SortExec elimination eq_properties() blindly trusted output_ordering from Parquet metadata without verifying that files within a group are in the correct inter-file order. This caused EnforceSorting to incorrectly remove SortExec when filesystem order didn't match data order, producing wrong results. Add validated_output_ordering() that filters output orderings using MinMaxStatistics to verify inter-file sort order, and wire it into eq_properties(). Add regression tests for reversed, overlapping, DESC, multi-column, and correctly-ordered multi-file scenarios. Co-Authored-By: Claude Opus 4.6 <[email protected]>
9dd01be to
b625a0f
Compare
zhuqi-lucas
left a comment
There was a problem hiding this comment.
The bug is real — eq_properties() uses raw output_ordering while get_projected_output_ordering() (L1362-1415) already validates with MinMax stats but is only used in DisplayAs. So EXPLAIN shows correct ordering, but the optimizer uses unvalidated ordering. Nice catch.
However, validated_output_ordering() duplicates the validation logic with subtle differences — get_projected_output_ordering() projects first then validates with ordered_column_indices_from_projection(), while this passes None for projection. Could we refactor to share the validation logic to avoid divergence?
And CI is failing on parquet_sorted_statistics.slt:267 — a constant_col query where all values are identical across files. In this case MinMaxStatistics can't prove inter-file ordering (min=max for every file → is_sorted() may reject it even though the output is trivially sorted since all values are equal).
This looks like a real regression: the fix is correct for the general case, but too conservative for constant columns. Might need to handle the min == max across all files as a special case, or check if is_sorted() treats equal ranges as sorted.
…ation logic Use non-strict comparison (<=) in MinMaxStatistics::is_sorted() so that constant columns (where all files have min=max) are correctly recognized as sorted, avoiding unnecessary SortExec insertion. Extract shared is_ordering_valid_for_file_groups() helper to eliminate duplicated validation logic between validated_output_ordering() and get_projected_output_ordering(). Co-Authored-By: Claude Opus 4.6 <[email protected]>
489c678 to
24952bd
Compare
There was a problem hiding this comment.
LGTM now, thanks @adriangb !
Minor:
Maybe follow-up improvement for some refactor:
validated_output_ordering() duplicates the validation logic with subtle differences — get_projected_output_ordering() projects first then validates with ordered_column_indices_from_projection(), while this passes None for projection.
Discovered this bug while working on apache#19724. TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted. - **`eq_properties()` in `FileScanConfig` blindly trusted `output_ordering`** (set from Parquet `sorting_columns` metadata) without verifying that files within a group are in the correct inter-file order - `EnforceSorting` then removed `SortExec` based on this unvalidated ordering, producing **wrong results** when filesystem order didn't match data order - Added `validated_output_ordering()` that filters orderings using `MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort order before reporting them to the optimizer - Added `validated_output_ordering()` method on `FileScanConfig` that validates each output ordering against actual file group statistics - Changed `eq_properties()` to call `self.validated_output_ordering()` instead of `self.output_ordering.clone()` Added 8 new regression tests (Tests 4-11): | Test | Scenario | Key assertion | |------|----------|---------------| | **4** | Reversed filesystem order (inferred ordering) | SortExec retained — wrong inter-file order detected | | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — overlapping ranges detected | | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained despite explicit ordering | | **7** | Correctly ordered multi-file group (positive) | SortExec eliminated — validation passes | | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained for DESC direction | | **9** | Multi-column sort key (overlapping vs non-overlapping) | Conservative rejection with overlapping stats; passes with clean boundaries | | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated — both ordering and stats agree | | **11** | Multiple partitions (one file per group) | `SortPreservingMergeExec` merges; no per-partition sort needed | - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + existing tests pass - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests pass - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still eliminates SortExec (no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
Discovered this bug while working on apache#19724. TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted. - **`eq_properties()` in `FileScanConfig` blindly trusted `output_ordering`** (set from Parquet `sorting_columns` metadata) without verifying that files within a group are in the correct inter-file order - `EnforceSorting` then removed `SortExec` based on this unvalidated ordering, producing **wrong results** when filesystem order didn't match data order - Added `validated_output_ordering()` that filters orderings using `MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort order before reporting them to the optimizer - Added `validated_output_ordering()` method on `FileScanConfig` that validates each output ordering against actual file group statistics - Changed `eq_properties()` to call `self.validated_output_ordering()` instead of `self.output_ordering.clone()` Added 8 new regression tests (Tests 4-11): | Test | Scenario | Key assertion | |------|----------|---------------| | **4** | Reversed filesystem order (inferred ordering) | SortExec retained — wrong inter-file order detected | | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — overlapping ranges detected | | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained despite explicit ordering | | **7** | Correctly ordered multi-file group (positive) | SortExec eliminated — validation passes | | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained for DESC direction | | **9** | Multi-column sort key (overlapping vs non-overlapping) | Conservative rejection with overlapping stats; passes with clean boundaries | | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated — both ordering and stats agree | | **11** | Multiple partitions (one file per group) | `SortPreservingMergeExec` merges; no per-partition sort needed | - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + existing tests pass - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests pass - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still eliminates SortExec (no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
## Summary Discovered this bug while working on apache#19724. TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted. - **`eq_properties()` in `FileScanConfig` blindly trusted `output_ordering`** (set from Parquet `sorting_columns` metadata) without verifying that files within a group are in the correct inter-file order - `EnforceSorting` then removed `SortExec` based on this unvalidated ordering, producing **wrong results** when filesystem order didn't match data order - Added `validated_output_ordering()` that filters orderings using `MinMaxStatistics::new_from_files()` + `is_sorted()` to verify inter-file sort order before reporting them to the optimizer ## Changes ### `datafusion/datasource/src/file_scan_config.rs` - Added `validated_output_ordering()` method on `FileScanConfig` that validates each output ordering against actual file group statistics - Changed `eq_properties()` to call `self.validated_output_ordering()` instead of `self.output_ordering.clone()` ### `datafusion/sqllogictest/test_files/sort_pushdown.slt` Added 8 new regression tests (Tests 4-11): | Test | Scenario | Key assertion | |------|----------|---------------| | **4** | Reversed filesystem order (inferred ordering) | SortExec retained — wrong inter-file order detected | | **5** | Overlapping file ranges (inferred ordering) | SortExec retained — overlapping ranges detected | | **6** | `WITH ORDER` + reversed filesystem order | SortExec retained despite explicit ordering | | **7** | Correctly ordered multi-file group (positive) | SortExec eliminated — validation passes | | **8** | DESC ordering with wrong inter-file DESC order | SortExec retained for DESC direction | | **9** | Multi-column sort key (overlapping vs non-overlapping) | Conservative rejection with overlapping stats; passes with clean boundaries | | **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec eliminated — both ordering and stats agree | | **11** | Multiple partitions (one file per group) | `SortPreservingMergeExec` merges; no per-partition sort needed | ## Test plan - [x] `cargo test --test sqllogictests -- sort_pushdown` — all new + existing tests pass - [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc tests pass - [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`) still eliminates SortExec (no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
) (#20509) - Part of #20287 - Closes #20508 on branch-52 This PR simply - Backports #20329 from @adriangb to the branch-52 line I simply cherry-picked 53b0ffb to the branch-52 branch ```shell andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ git cherry-pick 53b0ffb Auto-merging datafusion/core/tests/physical_optimizer/partition_statistics.rs Auto-merging datafusion/datasource/src/file_scan_config.rs [alamb/backport_20329 9286563] fix: validate inter-file ordering in eq_properties() (#20329) Author: Adrian Garcia Badaracco <[email protected]> Date: Sat Feb 14 14:04:01 2026 -0500 5 files changed, 660 insertions(+), 47 deletions(-) ``` Co-authored-by: Adrian Garcia Badaracco <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]>
Summary
Discovered this bug while working on #19724.
TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted.
eq_properties()inFileScanConfigblindly trustedoutput_ordering(set from Parquetsorting_columnsmetadata) without verifying that files within a group are in the correct inter-file orderEnforceSortingthen removedSortExecbased on this unvalidated ordering, producing wrong results when filesystem order didn't match data ordervalidated_output_ordering()that filters orderings usingMinMaxStatistics::new_from_files()+is_sorted()to verify inter-file sort order before reporting them to the optimizerChanges
datafusion/datasource/src/file_scan_config.rsvalidated_output_ordering()method onFileScanConfigthat validates each output ordering against actual file group statisticseq_properties()to callself.validated_output_ordering()instead ofself.output_ordering.clone()datafusion/sqllogictest/test_files/sort_pushdown.sltAdded 8 new regression tests (Tests 4-11):
WITH ORDER+ reversed filesystem orderWITH ORDER(positive)SortPreservingMergeExecmerges; no per-partition sort neededTest plan
cargo test --test sqllogictests -- sort_pushdown— all new + existing tests passcargo test -p datafusion-datasource— 97 unit tests + 6 doc tests passWITH ORDER) still eliminates SortExec (no regression)🤖 Generated with Claude Code