PARQUET-1685: Truncate Min/Max for Statistics#696
Conversation
gszadovszky
left a comment
There was a problem hiding this comment.
I would not add the truncate mechanism to the Statistics objects (see details in the code comments). ParquetMetadataConverter might be a better place.
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
parquet-column/src/main/java/org/apache/parquet/column/statistics/BinaryStatistics.java
Outdated
Show resolved
Hide resolved
|
Yes. Moved it to ParquetMetadataConverter for the write path only. |
| return formatStats; | ||
| } | ||
|
|
||
| public static void setStatisticsTruncateLength(int statisticsTruncateLength) { |
There was a problem hiding this comment.
This is not correct this way. Writer has a constructor argument so you shall not set here statically. I would suggest creating a new toParquetStatistics method with this argument instead.
There was a problem hiding this comment.
Yes, I know setting it here is a little hacky. The reasons for that are: 1) ParquetMetadataConverter is created as static inside ParquetFileWriter(don't' understand why). ParquetFileWriter has no chance to pass the parameter that it gets from it's constructor. 2) 'toParquetStatistics()' is defined as public static, so the constructor parameter doesn't work for it. But you suggested rewriting my own toParquetStatistics(). The concern is code duplicate.
The #2 is less concern that I can refactor the code a bit to reduce the duplicate. Any suggestions for #1? Maybe change metadataConverter to non-static? Concerns?
There was a problem hiding this comment.
Just created another iteration by changing the metadataConverter to non-static and create my own 'toParquetStatistics'. Please have a look.
There was a problem hiding this comment.
Sorry for the late response. I didn't want to recommend refactoring the usage of MetadataConverter for your change. I've just ment to have the truncateLength as a parameter of toParquetStatistics. You may keep the original method header as well so you do not need to rewrite all the calls only the ones are required for this change.
| * @param pageWriteChecksumEnabled whether to write out page level checksums | ||
| * @throws IOException if the file can not be created | ||
| */ | ||
| public ParquetFileWriter(OutputFile file, MessageType schema, Mode mode, |
There was a problem hiding this comment.
Please, check where the previous constructor is invoked and update those code parts as well.
In addition, if this change is targeted to the next release (1.11.0) and the previous constructor was not released yet, you may simply add the new argument to it. The key is to not break API compatibility between releases.
There was a problem hiding this comment.
Good point. I found I forgot to change the new ParquetFileWriter() inside ParquetWriter to the new signature. I will add it soon.
Since the 1.11.0 tag has been released and the current version is 1.12.0-SNAPSHOT, I would assume this will be in 1.12.0.
There was a problem hiding this comment.
This is because or faulty releasing process. 1.11.0 is not release yet. 1.11.0 tag will be moved to the finally released RC.
There was a problem hiding this comment.
Cool. Let's just add one parameter instead of creating a new signature.
18f938c to
c33cd36
Compare
| @Test | ||
| public void testBinaryStatsWithTruncation() { | ||
| int defaultTruncLen = ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH; | ||
| int[] validLengths = {1, 2, 16, 64, defaultTruncLen - 1, defaultTruncLen - 1}; |
There was a problem hiding this comment.
Why do you test defaultTruncLen -1 twice? I guess, defaultTrundLen itself shall also be tested.
There was a problem hiding this comment.
Good catch. That is a typo, I meant to test defaultTruncLen.
There was a problem hiding this comment.
- 1 is still there. :)
There was a problem hiding this comment.
Don't know how it is slipped. Fix it now.
| testBinaryStatsWithTruncation(len); | ||
| } | ||
|
|
||
| int[] invalidLengths = {-1, 0, defaultTruncLen + 1}; |
There was a problem hiding this comment.
defaultTruncLen + 1 is invalid only because ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH is Integer.MAX_VALUE so it overflows. Meanwhile, semantically a larger value of the default one shall be supported.
There was a problem hiding this comment.
Make sense. I will make it as "-1, 0, Integer.MAX_VALUE + 1"
There was a problem hiding this comment.
Don't know how it is slipped. Fix it now.
| @@ -591,9 +609,17 @@ public static Statistics toParquetStatistics( | |||
| if (!stats.isEmpty() && stats.isSmallerThan(MAX_STATS_SIZE)) { | |||
There was a problem hiding this comment.
So, we are not truncating if the size is too large? It should be the other way around: check the size after the truncate.
| } | ||
| } | ||
|
|
||
| private void testBinaryStatsWithTruncation(int truncateLen) { |
There was a problem hiding this comment.
Please add test data where we would reach the size limit of statistics.
| @Test | ||
| public void testBinaryStatsWithTruncation() { | ||
| int defaultTruncLen = ParquetProperties.DEFAULT_STATISTICS_TRUNCATE_LENGTH; | ||
| int[] validLengths = {1, 2, 16, 64, defaultTruncLen - 1, defaultTruncLen - 1}; |
There was a problem hiding this comment.
- 1 is still there. :)
| testBinaryStatsWithTruncation(len); | ||
| } | ||
|
|
||
| int[] invalidLengths = {-1, 0, defaultTruncLen + 1}; |
...t-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
Show resolved
Hide resolved
gszadovszky
left a comment
There was a problem hiding this comment.
One more finding.
The build/test fails because of PARQUET-1691. Please, wait with your last change for the PR #698 to get in so we can re-trigger the build and we will have all green lights.
| for (int len : invalidLengths) { | ||
| try { | ||
| testBinaryStatsWithTruncation(len, 80, 20); | ||
| } catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
This test would not fail if no exception is thrown.
| } catch (IllegalArgumentException e) { | |
| Assert.fail("..."); | |
| } catch (IllegalArgumentException e) { |
Make sure you have checked all steps below.
Jira
PARQUET-1685: Truncate Min/Max for Statistics
Tests
Added unit test
Write a Spark application test and it passed
Commits
Documentation