PARQUET-161: Statistics should be written for column chunks that are all null#103
PARQUET-161: Statistics should be written for column chunks that are all null#103saucam wants to merge 1 commit intoapache:masterfrom
Conversation
…re not null
2. Statistics object should be marked non empty in case null values are written
3. Keep a boolean in the object to identify presence of non-null values
|
Thanks for the PR! Can this be accomplished by only making a single change to the super class, something like this: I don't see why it needs to be in each subclass / why we need an extra variable to track this, if we can just make isEmpty() accurate we should be fine everywhere else right? Am I missing something? Thanks again! |
|
Hello Alex, I tried precisely the change you have mentioned in your pull request. The problem comes in BinaryStatistics.java , because when we are writing null value, we falsify the isEmpty() check , so when the next valid value is to be written, public void updateStats(Binary value) { gets called, which calls updateStats(value, value); at which point , min/max have not been initialized, and we again get NPE. Thats why we need this 3 state information in the BinaryStatistics.java , i.e empty column, no valid value column (or all nulls column) and lastly at-least 1 valid value column. |
|
@saucam But we can still track that in the super class right? |
|
done ! |
still to add test cases, verified with spark-sql scenario mentioned in PARQUET-136