Skip to content

fix: validate inter-file ordering in eq_properties()#20329

Merged
adriangb merged 6 commits intoapache:mainfrom
pydantic:fix-validate-inter-file-ordering
Feb 14, 2026
Merged

fix: validate inter-file ordering in eq_properties()#20329
adriangb merged 6 commits intoapache:mainfrom
pydantic:fix-validate-inter-file-ordering

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Feb 12, 2026

Summary

Discovered this bug while working on #19724.

TLDR: just because the files themselves are sorted doesn't mean the partition streams are sorted.

  • eq_properties() in FileScanConfig blindly trusted output_ordering (set from Parquet sorting_columns metadata) without verifying that files within a group are in the correct inter-file order
  • EnforceSorting then removed SortExec based on this unvalidated ordering, producing wrong results when filesystem order didn't match data order
  • Added validated_output_ordering() that filters orderings using MinMaxStatistics::new_from_files() + is_sorted() to verify inter-file sort order before reporting them to the optimizer

Changes

datafusion/datasource/src/file_scan_config.rs

  • Added validated_output_ordering() method on FileScanConfig that validates each output ordering against actual file group statistics
  • Changed eq_properties() to call self.validated_output_ordering() instead of self.output_ordering.clone()

datafusion/sqllogictest/test_files/sort_pushdown.slt

Added 8 new regression tests (Tests 4-11):

Test Scenario Key assertion
4 Reversed filesystem order (inferred ordering) SortExec retained — wrong inter-file order detected
5 Overlapping file ranges (inferred ordering) SortExec retained — overlapping ranges detected
6 WITH ORDER + reversed filesystem order SortExec retained despite explicit ordering
7 Correctly ordered multi-file group (positive) SortExec eliminated — validation passes
8 DESC ordering with wrong inter-file DESC order SortExec retained for DESC direction
9 Multi-column sort key (overlapping vs non-overlapping) Conservative rejection with overlapping stats; passes with clean boundaries
10 Correctly ordered + WITH ORDER (positive) SortExec eliminated — both ordering and stats agree
11 Multiple partitions (one file per group) SortPreservingMergeExec merges; no per-partition sort needed

Test plan

  • cargo test --test sqllogictests -- sort_pushdown — all new + existing tests pass
  • cargo test -p datafusion-datasource — 97 unit tests + 6 doc tests pass
  • Existing Test 1 (single-file sort pushdown with WITH ORDER) still eliminates SortExec (no regression)

🤖 Generated with Claude Code

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Feb 12, 2026
@adriangb
Copy link
Contributor Author

On main:

cargo test --test sqllogictests -- sort_pushdown 
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.18s
     Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-cd2788d771c0e63f)
Completed 1 test files in 0 seconds                                                                                                External error: 9 errors in file /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt

1. query result mismatch:
[SQL] EXPLAIN SELECT * FROM reversed_parquet ORDER BY id ASC;
[Diff] (-expected|+actual)
    logical_plan
    01)Sort: reversed_parquet.id ASC NULLS LAST
    02)--TableScan: reversed_parquet projection=[id, value]
-   physical_plan
-   01)SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]
-   02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/a_high.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/b_mid.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/c_low.parquet]]}, projection=[id, value], file_type=parquet
+   physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/a_high.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/b_mid.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/c_low.parquet]]}, projection=[id, value], file_type=parquet
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:900


2. query result mismatch:
[SQL] SELECT * FROM reversed_parquet ORDER BY id ASC;
[Diff] (-expected|+actual)
-   1 100
-   2 200
-   3 300
+   7 700
+   8 800
+   9 900
    4 400
    5 500
    6 600
-   7 700
-   8 800
-   9 900
+   1 100
+   2 200
+   3 300
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:911


3. query result mismatch:
[SQL] EXPLAIN SELECT * FROM overlap_parquet ORDER BY id ASC;
[Diff] (-expected|+actual)
    logical_plan
    01)Sort: overlap_parquet.id ASC NULLS LAST
    02)--TableScan: overlap_parquet projection=[id, value]
-   physical_plan
-   01)SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]
-   02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/overlap/file_x.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/overlap/file_y.parquet]]}, projection=[id, value], file_type=parquet
+   physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/overlap/file_x.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/overlap/file_y.parquet]]}, projection=[id, value], file_type=parquet
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:951


4. query result mismatch:
[SQL] SELECT * FROM overlap_parquet ORDER BY id ASC;
[Diff] (-expected|+actual)
    1 100
+   3 300
+   5 500
    2 200
-   3 300
    4 400
-   5 500
    6 600
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:962


5. query result mismatch:
[SQL] EXPLAIN SELECT * FROM reversed_with_order_parquet ORDER BY id ASC;
[Diff] (-expected|+actual)
    logical_plan
    01)Sort: reversed_with_order_parquet.id ASC NULLS LAST
    02)--TableScan: reversed_with_order_parquet projection=[id, value]
-   physical_plan
-   01)SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]
-   02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/a_high.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/b_mid.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/c_low.parquet]]}, projection=[id, value], file_type=parquet
+   physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/a_high.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/b_mid.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/reversed/c_low.parquet]]}, projection=[id, value], file_type=parquet
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:984


6. query result mismatch:
[SQL] SELECT * FROM reversed_with_order_parquet ORDER BY id ASC;
[Diff] (-expected|+actual)
-   1 100
-   2 200
-   3 300
+   7 700
+   8 800
+   9 900
    4 400
    5 500
    6 600
-   7 700
-   8 800
-   9 900
+   1 100
+   2 200
+   3 300
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:995


7. query result mismatch:
[SQL] EXPLAIN SELECT * FROM desc_reversed_parquet ORDER BY id DESC;
[Diff] (-expected|+actual)
    logical_plan
    01)Sort: desc_reversed_parquet.id DESC NULLS FIRST
    02)--TableScan: desc_reversed_parquet projection=[id, value]
-   physical_plan
-   01)SortExec: expr=[id@0 DESC], preserve_partitioning=[false]
-   02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/desc_reversed/a_low.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/desc_reversed/b_high.parquet]]}, projection=[id, value], file_type=parquet
+   physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/desc_reversed/a_low.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/desc_reversed/b_high.parquet]]}, projection=[id, value], file_type=parquet
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:1123


8. query result mismatch:
[SQL] SELECT * FROM desc_reversed_parquet ORDER BY id DESC;
[Diff] (-expected|+actual)
+   3 300
+   2 200
+   1 100
    9 900
    8 800
-   7 700
-   3 300
-   2 200
-   1 100
+   7 700
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:1134


9. query result mismatch:
[SQL] EXPLAIN SELECT * FROM multi_col_parquet ORDER BY category ASC, id ASC;
[Diff] (-expected|+actual)
    logical_plan
    01)Sort: multi_col_parquet.category ASC NULLS LAST, multi_col_parquet.id ASC NULLS LAST
    02)--TableScan: multi_col_parquet projection=[category, id, value]
-   physical_plan
-   01)SortExec: expr=[category@0 ASC NULLS LAST, id@1 ASC NULLS LAST], preserve_partitioning=[false]
-   02)--DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/multi_col/a_first.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/multi_col/b_second.parquet]]}, projection=[category, id, value], file_type=parquet
+   physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/multi_col/a_first.parquet, WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/sort_pushdown/multi_col/b_second.parquet]]}, projection=[category, id, value], file_type=parquet
at /Users/adrian/GitHub/datafusion-clone/datafusion/sqllogictest/test_files/sort_pushdown.slt:1178

@adriangb adriangb marked this pull request as ready for review February 12, 2026 22:22
@github-actions github-actions bot added the core Core DataFusion crate label Feb 12, 2026
@adriangb adriangb requested a review from zhuqi-lucas February 12, 2026 22:22
@adriangb
Copy link
Contributor Author

@zhuqi-lucas or @xudong963 could you take a look at this?

adriangb and others added 3 commits February 12, 2026 17:24
…rect SortExec elimination

eq_properties() blindly trusted output_ordering from Parquet metadata without
verifying that files within a group are in the correct inter-file order. This
caused EnforceSorting to incorrectly remove SortExec when filesystem order
didn't match data order, producing wrong results.

Add validated_output_ordering() that filters output orderings using
MinMaxStatistics to verify inter-file sort order, and wire it into
eq_properties(). Add regression tests for reversed, overlapping, DESC,
multi-column, and correctly-ordered multi-file scenarios.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug is real — eq_properties() uses raw output_ordering while get_projected_output_ordering() (L1362-1415) already validates with MinMax stats but is only used in DisplayAs. So EXPLAIN shows correct ordering, but the optimizer uses unvalidated ordering. Nice catch.
However, validated_output_ordering() duplicates the validation logic with subtle differences — get_projected_output_ordering() projects first then validates with ordered_column_indices_from_projection(), while this passes None for projection. Could we refactor to share the validation logic to avoid divergence?

And CI is failing on parquet_sorted_statistics.slt:267 — a constant_col query where all values are identical across files. In this case MinMaxStatistics can't prove inter-file ordering (min=max for every file → is_sorted() may reject it even though the output is trivially sorted since all values are equal).

This looks like a real regression: the fix is correct for the general case, but too conservative for constant columns. Might need to handle the min == max across all files as a special case, or check if is_sorted() treats equal ranges as sorted.

…ation logic

Use non-strict comparison (<=) in MinMaxStatistics::is_sorted() so that
constant columns (where all files have min=max) are correctly recognized
as sorted, avoiding unnecessary SortExec insertion.

Extract shared is_ordering_valid_for_file_groups() helper to eliminate
duplicated validation logic between validated_output_ordering() and
get_projected_output_ordering().

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@adriangb adriangb force-pushed the fix-validate-inter-file-ordering branch from 489c678 to 24952bd Compare February 13, 2026 17:46
@adriangb adriangb requested a review from zhuqi-lucas February 13, 2026 18:02
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks @adriangb !

Minor:
Maybe follow-up improvement for some refactor:

validated_output_ordering() duplicates the validation logic with subtle differences — get_projected_output_ordering() projects first then validates with ordered_column_indices_from_projection(), while this passes None for projection.

@adriangb adriangb added this pull request to the merge queue Feb 14, 2026
Merged via the queue into apache:main with commit 53b0ffb Feb 14, 2026
28 checks passed
@adriangb adriangb deleted the fix-validate-inter-file-ordering branch February 14, 2026 19:05
erratic-pattern pushed a commit to influxdata/arrow-datafusion that referenced this pull request Feb 19, 2026
Discovered this bug while working on apache#19724.

TLDR: just because the files themselves are sorted doesn't mean the
partition streams are sorted.

- **`eq_properties()` in `FileScanConfig` blindly trusted
`output_ordering`** (set from Parquet `sorting_columns` metadata)
without verifying that files within a group are in the correct
inter-file order
- `EnforceSorting` then removed `SortExec` based on this unvalidated
ordering, producing **wrong results** when filesystem order didn't match
data order
- Added `validated_output_ordering()` that filters orderings using
`MinMaxStatistics::new_from_files()` + `is_sorted()` to verify
inter-file sort order before reporting them to the optimizer

- Added `validated_output_ordering()` method on `FileScanConfig` that
validates each output ordering against actual file group statistics
- Changed `eq_properties()` to call `self.validated_output_ordering()`
instead of `self.output_ordering.clone()`

Added 8 new regression tests (Tests 4-11):

| Test | Scenario | Key assertion |
|------|----------|---------------|
| **4** | Reversed filesystem order (inferred ordering) | SortExec
retained — wrong inter-file order detected |
| **5** | Overlapping file ranges (inferred ordering) | SortExec
retained — overlapping ranges detected |
| **6** | `WITH ORDER` + reversed filesystem order | SortExec retained
despite explicit ordering |
| **7** | Correctly ordered multi-file group (positive) | SortExec
eliminated — validation passes |
| **8** | DESC ordering with wrong inter-file DESC order | SortExec
retained for DESC direction |
| **9** | Multi-column sort key (overlapping vs non-overlapping) |
Conservative rejection with overlapping stats; passes with clean
boundaries |
| **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec
eliminated — both ordering and stats agree |
| **11** | Multiple partitions (one file per group) |
`SortPreservingMergeExec` merges; no per-partition sort needed |

- [x] `cargo test --test sqllogictests -- sort_pushdown` — all new +
existing tests pass
- [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc
tests pass
- [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`)
still eliminates SortExec (no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
erratic-pattern pushed a commit to influxdata/arrow-datafusion that referenced this pull request Feb 19, 2026
Discovered this bug while working on apache#19724.

TLDR: just because the files themselves are sorted doesn't mean the
partition streams are sorted.

- **`eq_properties()` in `FileScanConfig` blindly trusted
`output_ordering`** (set from Parquet `sorting_columns` metadata)
without verifying that files within a group are in the correct
inter-file order
- `EnforceSorting` then removed `SortExec` based on this unvalidated
ordering, producing **wrong results** when filesystem order didn't match
data order
- Added `validated_output_ordering()` that filters orderings using
`MinMaxStatistics::new_from_files()` + `is_sorted()` to verify
inter-file sort order before reporting them to the optimizer

- Added `validated_output_ordering()` method on `FileScanConfig` that
validates each output ordering against actual file group statistics
- Changed `eq_properties()` to call `self.validated_output_ordering()`
instead of `self.output_ordering.clone()`

Added 8 new regression tests (Tests 4-11):

| Test | Scenario | Key assertion |
|------|----------|---------------|
| **4** | Reversed filesystem order (inferred ordering) | SortExec
retained — wrong inter-file order detected |
| **5** | Overlapping file ranges (inferred ordering) | SortExec
retained — overlapping ranges detected |
| **6** | `WITH ORDER` + reversed filesystem order | SortExec retained
despite explicit ordering |
| **7** | Correctly ordered multi-file group (positive) | SortExec
eliminated — validation passes |
| **8** | DESC ordering with wrong inter-file DESC order | SortExec
retained for DESC direction |
| **9** | Multi-column sort key (overlapping vs non-overlapping) |
Conservative rejection with overlapping stats; passes with clean
boundaries |
| **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec
eliminated — both ordering and stats agree |
| **11** | Multiple partitions (one file per group) |
`SortPreservingMergeExec` merges; no per-partition sort needed |

- [x] `cargo test --test sqllogictests -- sort_pushdown` — all new +
existing tests pass
- [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc
tests pass
- [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`)
still eliminates SortExec (no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
kosiew added a commit to kosiew/datafusion that referenced this pull request Feb 23, 2026
kosiew added a commit to kosiew/datafusion that referenced this pull request Feb 23, 2026
alamb pushed a commit to alamb/datafusion that referenced this pull request Feb 23, 2026
## Summary

Discovered this bug while working on apache#19724.

TLDR: just because the files themselves are sorted doesn't mean the
partition streams are sorted.

- **`eq_properties()` in `FileScanConfig` blindly trusted
`output_ordering`** (set from Parquet `sorting_columns` metadata)
without verifying that files within a group are in the correct
inter-file order
- `EnforceSorting` then removed `SortExec` based on this unvalidated
ordering, producing **wrong results** when filesystem order didn't match
data order
- Added `validated_output_ordering()` that filters orderings using
`MinMaxStatistics::new_from_files()` + `is_sorted()` to verify
inter-file sort order before reporting them to the optimizer

## Changes

### `datafusion/datasource/src/file_scan_config.rs`
- Added `validated_output_ordering()` method on `FileScanConfig` that
validates each output ordering against actual file group statistics
- Changed `eq_properties()` to call `self.validated_output_ordering()`
instead of `self.output_ordering.clone()`

### `datafusion/sqllogictest/test_files/sort_pushdown.slt`
Added 8 new regression tests (Tests 4-11):

| Test | Scenario | Key assertion |
|------|----------|---------------|
| **4** | Reversed filesystem order (inferred ordering) | SortExec
retained — wrong inter-file order detected |
| **5** | Overlapping file ranges (inferred ordering) | SortExec
retained — overlapping ranges detected |
| **6** | `WITH ORDER` + reversed filesystem order | SortExec retained
despite explicit ordering |
| **7** | Correctly ordered multi-file group (positive) | SortExec
eliminated — validation passes |
| **8** | DESC ordering with wrong inter-file DESC order | SortExec
retained for DESC direction |
| **9** | Multi-column sort key (overlapping vs non-overlapping) |
Conservative rejection with overlapping stats; passes with clean
boundaries |
| **10** | Correctly ordered + `WITH ORDER` (positive) | SortExec
eliminated — both ordering and stats agree |
| **11** | Multiple partitions (one file per group) |
`SortPreservingMergeExec` merges; no per-partition sort needed |

## Test plan

- [x] `cargo test --test sqllogictests -- sort_pushdown` — all new +
existing tests pass
- [x] `cargo test -p datafusion-datasource` — 97 unit tests + 6 doc
tests pass
- [x] Existing Test 1 (single-file sort pushdown with `WITH ORDER`)
still eliminates SortExec (no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
alamb added a commit that referenced this pull request Feb 24, 2026
) (#20509)

- Part of #20287
- Closes #20508 on branch-52

This PR simply
- Backports #20329 from
@adriangb to the branch-52 line


I simply cherry-picked 53b0ffb to the branch-52 branch

```shell
andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion2$ git cherry-pick 53b0ffb
Auto-merging datafusion/core/tests/physical_optimizer/partition_statistics.rs
Auto-merging datafusion/datasource/src/file_scan_config.rs
[alamb/backport_20329 9286563] fix: validate inter-file ordering in eq_properties() (#20329)
 Author: Adrian Garcia Badaracco <[email protected]>
 Date: Sat Feb 14 14:04:01 2026 -0500
 5 files changed, 660 insertions(+), 47 deletions(-)
```

Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Co-authored-by: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Invalid inter-file ordering in eq_properties()

2 participants