PARQUET-161: Statistics should be written for column chunks that are all null by saucam · Pull Request #103 · apache/parquet-java

saucam · 2015-01-09T13:45:10Z

         1. Statistics object should be marked non empty in case null values are written
         2. Keep a boolean in the object to identify presence of non-null values

still to add test cases, verified with spark-sql scenario mentioned in PARQUET-136

…re not null 2. Statistics object should be marked non empty in case null values are written 3. Keep a boolean in the object to identify presence of non-null values

isnotinvain · 2015-01-09T21:04:22Z

Thanks for the PR!

Can this be accomplished by only making a single change to the super class, something like this:
https://github.com/isnotinvain/incubator-parquet-mr/compare/alexlevenson/PARQUET-161

I don't see why it needs to be in each subclass / why we need an extra variable to track this, if we can just make isEmpty() accurate we should be fine everywhere else right? Am I missing something?

Thanks again!
Alex

saucam · 2015-01-10T02:57:24Z

Hello Alex,

I tried precisely the change you have mentioned in your pull request. The problem comes in BinaryStatistics.java , because when we are writing null value, we falsify the isEmpty() check , so when the next valid value is to be written,

public void updateStats(Binary value) {

gets called, which calls

updateStats(value, value);

at which point , min/max have not been initialized, and we again get NPE.

Thats why we need this 3 state information in the BinaryStatistics.java , i.e empty column, no valid value column (or all nulls column) and lastly at-least 1 valid value column.

isnotinvain · 2015-01-11T03:52:33Z

@saucam But we can still track that in the super class right?
EG, in the PR I sent you, we could just change this line:
if (this.isEmpty()) { in BinaryStatistics (and others) to if (!this.hasNonNullValue()) { right?

julienledem · 2015-01-30T00:34:13Z

@saucam: Is this PR still applicable after #99?
If not please close it.

saucam · 2015-01-30T01:53:16Z

done !

PARQUET-161: 1. Statistics should be written for column chunks that a…

4b41d0e

…re not null 2. Statistics object should be marked non empty in case null values are written 3. Keep a boolean in the object to identify presence of non-null values

isnotinvain mentioned this pull request Jan 9, 2015

PARQUET-136: NPE thrown in StatisticsFilter when all values in a string/binary column trunk are null #99

Closed

saucam closed this Jan 30, 2015

saucam deleted the write_nulls branch January 30, 2015 01:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-161: Statistics should be written for column chunks that are all null#103

PARQUET-161: Statistics should be written for column chunks that are all null#103
saucam wants to merge 1 commit intoapache:masterfrom
saucam:write_nulls

saucam commented Jan 9, 2015

Uh oh!

isnotinvain commented Jan 9, 2015

Uh oh!

saucam commented Jan 10, 2015

Uh oh!

isnotinvain commented Jan 11, 2015

Uh oh!

julienledem commented Jan 30, 2015

Uh oh!

saucam commented Jan 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

saucam commented Jan 9, 2015

Uh oh!

isnotinvain commented Jan 9, 2015

Uh oh!

saucam commented Jan 10, 2015

Uh oh!

isnotinvain commented Jan 11, 2015

Uh oh!

julienledem commented Jan 30, 2015

Uh oh!

saucam commented Jan 30, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants