Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune#10051
Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune#10051Ted-Jiang merged 10 commits intoapache:mainfrom
Conversation
and fix NOT NULL prune
| None | ||
| // see https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982 | ||
| let mut first_row_index = self | ||
| .col_offset_indexes |
There was a problem hiding this comment.
I think you can do this same calculation without allocating intermediate vec s-- something like this:
let row_count_per_page = self
.col_offset_indexes
.windows(2)
.map(|location| Some(location[1].first_row_index - location[0].first_row_index));
Some(Arc::new(Int64Array::from_iter(row_count_per_page)))🤔 the name col_offset_indexes is somewhat confusing to me as they are PageLocations --
col_offset_indexes: &'a Vec<PageLocation>,
maybe we could rename that field to page_locations 🤔
alamb
left a comment
There was a problem hiding this comment.
Thanks @Ted-Jiang -- this is looking good. I think it could be merged as is, though I also have several suggestions that I think will make the code and comments better. We can do so as a follow on PR though.
| @@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> { | |||
| } | |||
| Scenario::WithNullValues => { | |||
There was a problem hiding this comment.
Could you please also add the "some null/some not null" null values case to this scenario as well?
make_int_batches_with_null(5, 1, 6),
| } | ||
| Scenario::WithNullValuesPageLevel => { | ||
| vec![ | ||
| make_int_batches_with_null(5, 1, 6), |
There was a problem hiding this comment.
I think this scenario should have at least one batch of all nulls (e.g. make_int_batches_with_nulls(5, 0, 0))
There was a problem hiding this comment.
i think this will be prune by row group -> column level
| no_null_values_start: usize, | ||
| no_null_values_end: usize, |
There was a problem hiding this comment.
I think you might be able to simplify this method by sending in 2 parameters. Right now it looks like it interleaves nulls arbitrarily but really the nulls are always at the start and non nulls are at the end
num_nulls: usize,
non_nulls: usize| // row_group4: page1(1~5), page2(All Null) | ||
| // total 40 rows | ||
| async fn test_pages_with_null_values() { | ||
| test_prune( |
There was a problem hiding this comment.
It took me a while to convince myself that this was actually setting up the scenario as described. I eventually found it here:
I wonder if it would be possible to add some better comments to test_prune mentioning that the created parquet files have 5 rows per page
There was a problem hiding this comment.
i will make another pr to improve this tests
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
|
Thanks again @Ted-Jiang -- I took another look through this PR and it is very nice 👌 |
…L prune (apache#10051) * Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune * fix clippy * Update datafusion/core/src/physical_optimizer/pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/core/tests/parquet/page_pruning.rs Co-authored-by: Andrew Lamb <[email protected]> * remove allocate vec * better way avoid allocate vec * simply expr --------- Co-authored-by: Andrew Lamb <[email protected]>
and fix NOT NULL prune
Which issue does this PR close?
Closes #9961
Closes #10049.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?