Support page skipping / page_index pushdown for evolved schemas#5268
Support page skipping / page_index pushdown for evolved schemas#5268alamb merged 4 commits intoapache:masterfrom
Conversation
| .into_iter() | ||
| .map(|batch| { | ||
| let mut output = NamedTempFile::new().expect("creating temp file"); | ||
| // Each batch writes to their own file |
There was a problem hiding this comment.
This change is so I can actually test evolved schemas with page indexes (aka write multiple files with different schemas)
| } | ||
| }); | ||
| set | ||
| pub(crate) fn required_columns(&self) -> &RequiredStatColumns { |
There was a problem hiding this comment.
This is the core change -- need_input_columns_ids returns indexes in terms of the overall table schema. If an individual parquet file does not have all the columns or has the columns in a different order, these indexes are not correct
There was a problem hiding this comment.
Thanks for explanation! 👍
If an individual parquet file does not have all the columns or has the columns in a different order
I have a question about if file_a (c1, c2), file_b(c3, c1), do df support create external table t(c1) on both file_a and file_b 🤔
There was a problem hiding this comment.
And file_a (c1, c2), file_b(c3) , support create external table t(c1)?
Do both file have to have the c1 meta in both parquet files meta ?
There was a problem hiding this comment.
i see both situation support in below test 😆
| continue; | ||
| }; | ||
| // find column index by looking in the row group metadata. | ||
| let col_idx = find_column_index(predicate, &groups[0]); |
There was a problem hiding this comment.
this calls the new per-file column index logic. I considered some more major rearranging of this code (like to have it do the column index lookup in the pruning stats) but I felt this way was easiest to review and was likely to be the most performant as well
| async fn parquet_page_index_exec_metrics() { | ||
| let c1: ArrayRef = Arc::new(Int32Array::from(vec![Some(1), None, Some(2)])); | ||
| let c2: ArrayRef = Arc::new(Int32Array::from(vec![Some(3), Some(4), Some(5)])); | ||
| let c1: ArrayRef = Arc::new(Int32Array::from(vec![ |
There was a problem hiding this comment.
This was the only test that used the "merge multiple batches together" behavior of store_parquet -- so I rewrote the tests to inline the creation and ensure we got evenly created two row pages
| #[rustfmt::skip] | ||
| let expected = vec![ | ||
| "+-----+", "| int |", "+-----+", "| 3 |", "| 4 |", "| 5 |", "+-----+", | ||
| "+-----+", |
There was a problem hiding this comment.
this is different because previously the page layout was as follows
Page1:
1
None
2
Page 2
3
4
5
Now the page layout is
Page1:
1
None
Page2
2
3
Page3
4
5
| /// For example, give the predicate `y > 5` | ||
| /// | ||
| /// And columns in the RowGroupMetadata like `['x', 'y', 'z']` will | ||
| /// return 1. |
There was a problem hiding this comment.
| /// For example, give the predicate `y > 5` | |
| /// | |
| /// And columns in the RowGroupMetadata like `['x', 'y', 'z']` will | |
| /// return 1. | |
| /// For example, give the predicate `y > 5`and columns in the | |
| /// RowGroupMetadata like `['x', 'y', 'z']` will return 1. |
|
@Ted-Jiang and @thinkharderdev I think I have finally fixed the bug with page index pushdown. I think @Dandandan had asked about the status of this project as well |
| // batch3 (has c2, c1) -- both columns, should still prune | ||
| let batch3 = create_batch(vec![("c1", c1.clone()), ("c2", c2.clone())]); | ||
|
|
||
| // batch4 (has c2, c1) -- different column order, should still prune |
Co-authored-by: Yang Jiang <[email protected]>
|
Can we un-ignore |
|
@Dandandan, I apologize, but I don't understand this request
That test does not appear to be ignored on master (nor in this PR): I only found one instance of this test in the repo: I couldn't find any other |
|
@alamb |
Yes! 🎉 I verified that all CI passes with page index pushdown enabled by default (when this PR change is included). Check out #5099. I should have mentioned that. Sorry |
|
Benchmark runs are scheduled for baseline = d05647c and contender = ec24724. ec24724 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…he#5268) * Make the page index tests clearer about what they are doing * Support page skipping / page_index pushdown for evolved schemas * upate test * Update datafusion/core/src/datasource/file_format/parquet.rs Co-authored-by: Yang Jiang <[email protected]> --------- Co-authored-by: Yang Jiang <[email protected]>
Which issue does this PR close?
Closes #5104
Rationale for this change
See #5104
I want to turn on page index pushdown for all queries. I can't do that if it causes errors for queries across parquet files where the schema is not the same
What changes are included in this PR?
Fix a bug (I will describe inline)
Are these changes tested?
Yes
IN PROGRESS -- testing against #5099
Are there any user-facing changes?
Less bugs