Skip to content

Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL#9989

Merged
alamb merged 7 commits intoapache:mainfrom
Ted-Jiang:issue-9961
Apr 10, 2024
Merged

Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL#9989
alamb merged 7 commits intoapache:mainfrom
Ted-Jiang:issue-9961

Conversation

@Ted-Jiang
Copy link
Member

…atistics

Which issue does this PR close?

Part #9961 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Apr 8, 2024
column_statistics: Vec<ParquetStatistics>,
) -> RowGroupMetaData {
let mut columns = vec![];
let number_row = 1000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before all unit test set each col with default 0 row, will trigger num_rows == num_nulls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).

However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.

@Ted-Jiang
Copy link
Member Author

@alamb @waynexia PTAL

@Ted-Jiang Ted-Jiang requested review from alamb and waynexia April 8, 2024 03:50
Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me 👍 Thanks @Ted-Jiang

None
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
let (c, _) = self.column(&column.name)?;
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented Apr 8, 2024

Thanks @Ted-Jiang -- this looks awesome -- I look forward to reviewing it later today

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- and @waynexia . I think the code looks good, though I think a few more tests would improve the coverage we could add them as a follow on PR.

Do you plan to add support in page_filter.rs as well (maybe that is why the PR is marked "Part #9961 ")?

column_statistics: Vec<ParquetStatistics>,
) -> RowGroupMetaData {
let mut columns = vec![];
let number_row = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if this could cause problems in real files (for example, if the row counts were not included in the statistics that were written into the file).

However, I double checked the code and it seems like ColumnChunkMetaData::num_values is non nullable so I think we are good.

// After pruning, only row group 1,3 should be selected
RowGroupPruningTest::new()
.with_scenario(Scenario::AllNullValues)
.with_query("SELECT * FROM t WHERE \"i8\" is Null")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a tests:

  1. i16 IS NOT NULL (to cover the opposite)
  2. i32 > 7 (prune via nulls and some via counts)

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks! Add test in 11567d9, and support the isNotNull

Do you plan to add support in page_filter.rs as well (maybe that is why the PR is marked "Part
#9961 ")?

As the page level prune i prefer in next pr to keep this pr short and clean.

None
fn row_counts(&self, column: &Column) -> Option<ArrayRef> {
let (c, _) = self.column(&column.name)?;
let scalar = ScalarValue::UInt64(Some(c.num_values() as u64));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb changed the title Prune columns are all null in ParquetExec by row_counts in pruning st… Prune columns are all null in ParquetExec by row_counts , handle IS NOT NULL Apr 10, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- the support for IS NOT NULL is also quite neat.

@alamb alamb merged commit ed37467 into apache:main Apr 10, 2024
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Apr 12, 2024

Thanks @Ted-Jiang -- the support for IS NOT NULL is also quite neat.

@alamb sorry there is a bug here i will fix this it should

/// If set `with_not` to true: which means is not null
/// datafusion use false flag to prune unit (row group, page ..)
/// Given an expression reference to `expr`, if `expr` is a column expression,
/// returns a pruning expression in terms of IsNotNull that will evaluate to true
/// if the column may contain any nonnull values, and false if definitely does not contain
/// nonnull values which any null values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants