PARQUET-139: Avoid reading footers when using task-side metadata#91
PARQUET-139: Avoid reading footers when using task-side metadata#91rdblue wants to merge 6 commits intoapache:masterfrom
Conversation
33db70a to
f6f081a
Compare
|
I've merged the ParquetFileInputFormat into the original ParquetInputFormat, as discussed in the sync-up. I think this is close to being ready to merge and is ready for review. |
There was a problem hiding this comment.
Is this change related to this JIRA? If not, might be best to commit separately.
There was a problem hiding this comment.
It is related, but could be considered a separate bug against the Avro read support. I included it because this pull request changes the behavior so that each file schema is validated with the projection schema. This is done for all object models except for Avro because this was previously missing. I think it was omitted from Avro because the name isn't accurate for what it does.
|
thank @rdblue |
d8c6dbe to
5e95d8b
Compare
|
@julienledem and @tomwhite I think I've addressed all of the review issues, though there were a few that I explained better rather than changing. I'd like to get this in shortly and push for a 1.6.0 release, so please take another look when you have time. Thanks! |
|
My feedback has been addressed, so I'm +1 on this once the travis run passes. |
|
@julienledem I'd like to commit these changes this week, if you have a chance to take a second look. Thanks! |
|
LGTM +1 |
This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change: 1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema. 2. To avoid reading file footers on the client side, which is a performance bottleneck. Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.
This refactors the ReadSupport class handling in ParquetInputFormat. It will now use the ReadSupport class if it is set on the InputFormat in the constructor. This allows InputFormat subclasses to use a specific ReadSupport class without setting it in the Configuration. This also removes the use of getReadSupport in the mapred InputFormat, where it was unnecessary.
This adds a new InputFormat that uses FileSplit. This is to avoid incompatible behavior changes from modifying ParquetInputFormat. This uses the same ParquetRecordReader structure.
This merges the changes from the FileSplit-based InputFormat into ParquetInputFormat in place of the task-side split strategy. This is mostly a refactor of the previous commit after testing the InputFormat changes in isolation.
The ReadSupport class needs to be parameterized in all cases, and not all cases with mapred InputSplits were correctly converted to the mapreduce API.
cc6bada to
cb30660
Compare
|
Thanks for the reviews! I've rebased and will merge this when tests pass. |
|
It looks like Travis has a huge backlog, so I tested this locally instead of waiting for the job. |
This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change: 1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema. 2. To avoid reading file footers on the client side, which is a performance bottleneck. Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed. This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat. Author: Ryan Blue <[email protected]> Closes apache#91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits: cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes. 09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews. 3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat. 8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit. 87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods. 057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks. Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java parquet-hadoop/src/test/java/parquet/hadoop/TestInputFormat.java Resolutions: ParquetInputFormat conflict from unbackported strict type checking ParquetInputSplit conflict from methods added back for compatibility Other conflicts were minor
This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change: 1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema. 2. To avoid reading file footers on the client side, which is a performance bottleneck. Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed. This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat. Author: Ryan Blue <[email protected]> Closes apache#91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits: cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes. 09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews. 3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat. 8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit. 87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods. 057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks. Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java parquet-hadoop/src/test/java/parquet/hadoop/TestInputFormat.java Resolutions: ParquetInputFormat conflict from unbackported strict type checking ParquetInputSplit conflict from methods added back for compatibility Other conflicts were minor
This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:
Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.
This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.