Support page skipping / page_index pushdown for evolved schemas by alamb · Pull Request #5268 · apache/datafusion

alamb · 2023-02-13T21:04:06Z

Which issue does this PR close?

Rationale for this change

I want to turn on page index pushdown for all queries. I can't do that if it causes errors for queries across parquet files where the schema is not the same

What changes are included in this PR?

Fix a bug (I will describe inline)

Are these changes tested?

Yes

IN PROGRESS -- testing against #5099

Are there any user-facing changes?

Less bugs

alamb · 2023-02-13T21:04:41Z

datafusion/core/src/datasource/file_format/parquet.rs

-                .into_iter()
-                .map(|batch| {
-                    let mut output = NamedTempFile::new().expect("creating temp file");
+        // Each batch writes to their own file


This change is so I can actually test evolved schemas with page indexes (aka write multiple files with different schemas)

alamb · 2023-02-13T21:05:57Z

datafusion/core/src/physical_optimizer/pruning.rs

-            }
-        });
-        set
+    pub(crate) fn required_columns(&self) -> &RequiredStatColumns {


This is the core change -- need_input_columns_ids returns indexes in terms of the overall table schema. If an individual parquet file does not have all the columns or has the columns in a different order, these indexes are not correct

Thanks for explanation! 👍

If an individual parquet file does not have all the columns or has the columns in a different order

I have a question about if file_a (c1, c2), file_b(c3, c1), do df support create external table t(c1) on both file_a and file_b 🤔

And file_a (c1, c2), file_b(c3) , support create external table t(c1)?
Do both file have to have the c1 meta in both parquet files meta ?

i see both situation support in below test 😆

alamb · 2023-02-13T21:13:32Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

-                    continue;
-                };
+            // find column index by looking in the row group metadata.
+            let col_idx = find_column_index(predicate, &groups[0]);


this calls the new per-file column index logic. I considered some more major rearranging of this code (like to have it do the column index lookup in the pruning stats) but I felt this way was easiest to review and was likely to be the most performant as well

alamb · 2023-02-13T21:37:13Z

datafusion/core/src/physical_plan/file_format/parquet.rs

    async fn parquet_page_index_exec_metrics() {
-        let c1: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), None, Some(2)]));
-        let c2: ArrayRef = Arc::new(Int32Array::from(vec![Some(3), Some(4), Some(5)]));
+        let c1: ArrayRef = Arc::new(Int32Array::from(vec![


This was the only test that used the "merge multiple batches together" behavior of store_parquet -- so I rewrote the tests to inline the creation and ensure we got evenly created two row pages

alamb · 2023-02-13T21:40:25Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+        #[rustfmt::skip]
        let expected = vec![
-            "+-----+", "| int |", "+-----+", "| 3   |", "| 4   |", "| 5   |", "+-----+",
+            "+-----+",


this is different because previously the page layout was as follows

Page1:
1
None
2

Page 2
3
4
5

Now the page layout is

Page1:
1
None

Page2
2
3

Page3
4
5

alamb · 2023-02-13T21:40:54Z

datafusion/core/src/physical_plan/file_format/parquet/page_filter.rs

+/// For example, give the predicate `y > 5`
+///
+/// And columns in the RowGroupMetadata like `['x', 'y', 'z']` will
+/// return 1.


Suggested change

/// For example, give the predicate `y > 5`

///

/// And columns in the RowGroupMetadata like `['x', 'y', 'z']` will

/// return 1.

/// For example, give the predicate `y > 5`and columns in the

/// RowGroupMetadata like `['x', 'y', 'z']` will return 1.

alamb · 2023-02-13T21:45:39Z

@Ted-Jiang and @thinkharderdev I think I have finally fixed the bug with page index pushdown.

I think @Dandandan had asked about the status of this project as well

datafusion/core/src/datasource/file_format/parquet.rs

Ted-Jiang · 2023-02-14T06:19:11Z

datafusion/core/src/physical_plan/file_format/parquet.rs

+        // batch3 (has c2, c1) -- both columns, should still prune
+        let batch3 = create_batch(vec![("c1", c1.clone()), ("c2", c2.clone())]);
+
+        // batch4 (has c2, c1) -- different column order, should still prune


Nice test case 👍

Co-authored-by: Yang Jiang <[email protected]>

Dandandan · 2023-02-15T10:56:19Z

Can we un-ignore evolved_schema_disjoint_schema_filter and other tests with this PR too?

alamb · 2023-02-15T15:11:00Z

@Dandandan, I apologize, but I don't understand this request

Can we un-ignore evolved_schema_disjoint_schema_filter and other tests with this PR too?

That test does not appear to be ignored on master (nor in this PR):

https://github.com/apache/arrow-datafusion/blob/07ae738b48b6001498a4b10e9750422f9d8e1b7f/datafusion/core/src/physical_plan/file_format/parquet.rs#L1257-L1259

I only found one instance of this test in the repo:
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion%20evolved_schema_disjoint_schema_filter&type=code

I couldn't find any other #ignore'd test elsewhere in the repo that looked relevant.

Dandandan · 2023-02-15T17:20:22Z

@alamb
Ah ok, I apologize, the tests are not ignored indeed.
"Improved" question, does it fix those tests when setting enable_page_index to true?

    physical_plan::file_format::parquet::tests::evolved_schema_disjoint_schema_filter
    physical_plan::file_format::parquet::tests::evolved_schema_disjoint_schema_with_filter_pushdown
    physical_plan::file_format::parquet::tests::evolved_schema_intersection_filter
    physical_plan::file_format::parquet::tests::evolved_schema_intersection_filter_with_filter_pushdown

alamb · 2023-02-15T20:32:05Z

"Improved" question, does it fix those tests when setting enable_page_index to true?

Yes! 🎉 I verified that all CI passes with page index pushdown enabled by default (when this PR change is included). Check out #5099. I should have mentioned that. Sorry

ursabot · 2023-02-15T20:43:06Z

Benchmark runs are scheduled for baseline = d05647c and contender = ec24724. ec24724 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…he#5268) * Make the page index tests clearer about what they are doing * Support page skipping / page_index pushdown for evolved schemas * upate test * Update datafusion/core/src/datasource/file_format/parquet.rs Co-authored-by: Yang Jiang <[email protected]> --------- Co-authored-by: Yang Jiang <[email protected]>

alamb added 2 commits February 13, 2023 16:01

Make the page index tests clearer about what they are doing

038b643

Support page skipping / page_index pushdown for evolved schemas

0ae899e

github-actions bot added the core Core DataFusion crate label Feb 13, 2023

alamb marked this pull request as draft February 13, 2023 21:06

upate test

e3abcb5

alamb commented Feb 13, 2023

View reviewed changes

alamb marked this pull request as ready for review February 13, 2023 21:44

alamb requested review from Ted-Jiang and thinkharderdev February 13, 2023 21:45

Ted-Jiang reviewed Feb 14, 2023

View reviewed changes

datafusion/core/src/datasource/file_format/parquet.rs Outdated Show resolved Hide resolved

Ted-Jiang reviewed Feb 14, 2023

View reviewed changes

Ted-Jiang approved these changes Feb 14, 2023

View reviewed changes

Update datafusion/core/src/datasource/file_format/parquet.rs

d154b9e

Co-authored-by: Yang Jiang <[email protected]>

Dandandan approved these changes Feb 15, 2023

View reviewed changes

alamb merged commit ec24724 into apache:master Feb 15, 2023

alamb deleted the alamb/evolved_schema branch February 15, 2023 20:32

Conversation

alamb commented Feb 13, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ted-Jiang Feb 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 13, 2023

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Feb 15, 2023

Uh oh!

Dandandan commented Feb 15, 2023

Uh oh!

alamb commented Feb 15, 2023

Uh oh!

ursabot commented Feb 15, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ted-Jiang Feb 14, 2023 •

edited

Loading

Dandandan commented Feb 15, 2023 •

edited

Loading