PARQUET-139: Avoid reading footers when using task-side metadata by rdblue · Pull Request #91 · apache/parquet-java

rdblue · 2014-12-04T01:17:16Z

This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:

For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema.
To avoid reading file footers on the client side, which is a performance bottleneck.

Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.

This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.

rdblue · 2014-12-12T07:21:46Z

I've merged the ParquetFileInputFormat into the original ParquetInputFormat, as discussed in the sync-up. I think this is close to being ready to merge and is ready for review.

tomwhite · 2014-12-15T11:22:35Z

parquet-avro/src/main/java/parquet/avro/AvroReadSupport.java

Is this change related to this JIRA? If not, might be best to commit separately.

It is related, but could be considered a separate bug against the Avro read support. I included it because this pull request changes the behavior so that each file schema is validated with the projection schema. This is done for all object models except for Avro because this was previously missing. I think it was omitted from Avro because the name isn't accurate for what it does.

julienledem · 2015-01-08T23:29:08Z

thank @rdblue
I did a first pass of comments.
Overall this looks good to me.

rdblue · 2015-01-28T21:30:52Z

@julienledem and @tomwhite I think I've addressed all of the review issues, though there were a few that I explained better rather than changing. I'd like to get this in shortly and push for a 1.6.0 release, so please take another look when you have time. Thanks!

tomwhite · 2015-02-02T14:12:16Z

My feedback has been addressed, so I'm +1 on this once the travis run passes.

rdblue · 2015-02-04T05:24:58Z

@julienledem I'd like to commit these changes this week, if you have a chance to take a second look. Thanks!

julienledem · 2015-02-04T18:18:40Z

LGTM +1
Let's plan a follow-up PR to simplify the code a little by removing the SplitStrategy remaining class.

This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change: 1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema. 2. To avoid reading file footers on the client side, which is a performance bottleneck. Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.

This refactors the ReadSupport class handling in ParquetInputFormat. It will now use the ReadSupport class if it is set on the InputFormat in the constructor. This allows InputFormat subclasses to use a specific ReadSupport class without setting it in the Configuration. This also removes the use of getReadSupport in the mapred InputFormat, where it was unnecessary.

This adds a new InputFormat that uses FileSplit. This is to avoid incompatible behavior changes from modifying ParquetInputFormat. This uses the same ParquetRecordReader structure.

This merges the changes from the FileSplit-based InputFormat into ParquetInputFormat in place of the task-side split strategy. This is mostly a refactor of the previous commit after testing the InputFormat changes in isolation.

The ReadSupport class needs to be parameterized in all cases, and not all cases with mapred InputSplits were correctly converted to the mapreduce API.

rdblue · 2015-02-05T17:01:36Z

Thanks for the reviews! I've rebased and will merge this when tests pass.

rdblue · 2015-02-05T23:07:08Z

It looks like Travis has a huge backlog, so I tested this locally instead of waiting for the job.

This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change: 1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema. 2. To avoid reading file footers on the client side, which is a performance bottleneck. Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed. This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat. Author: Ryan Blue <[email protected]> Closes apache#91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits: cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes. 09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews. 3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat. 8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit. 87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods. 057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks. Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java parquet-hadoop/src/test/java/parquet/hadoop/TestInputFormat.java Resolutions: ParquetInputFormat conflict from unbackported strict type checking ParquetInputSplit conflict from methods added back for compatibility Other conflicts were minor

rdblue force-pushed the PARQUET-139-input-format-task-side branch from 33db70a to f6f081a Compare December 4, 2014 01:38

tomwhite reviewed Dec 15, 2014
View reviewed changes

rdblue force-pushed the PARQUET-139-input-format-task-side branch 2 times, most recently from d8c6dbe to 5e95d8b Compare January 28, 2015 21:27

rdblue added 6 commits February 5, 2015 09:01

PARQUET-139: Add ParquetFileInputFormat that uses FileSplit.

8971b80

This adds a new InputFormat that uses FileSplit. This is to avoid incompatible behavior changes from modifying ParquetInputFormat. This uses the same ParquetRecordReader structure.

PARQUET-139: Implement changes from reviews.

09cde8d

PARQUET-139: Fix deprecated reader bug from review fixes.

cb30660

The ReadSupport class needs to be parameterized in all cases, and not all cases with mapred InputSplits were correctly converted to the mapreduce API.

rdblue force-pushed the PARQUET-139-input-format-task-side branch from cc6bada to cb30660 Compare February 5, 2015 17:01

asfgit closed this in ce65dfb Feb 5, 2015

liancheng mentioned this pull request Apr 4, 2015

[WIP][SQL][SPARK-6632]: Read schema from each input split in the ReadSupport hook, reconciling with the metastore schema at that time apache/spark#5298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-139: Avoid reading footers when using task-side metadata#91

PARQUET-139: Avoid reading footers when using task-side metadata#91
rdblue wants to merge 6 commits intoapache:masterfrom
rdblue:PARQUET-139-input-format-task-side

rdblue commented Dec 4, 2014

Uh oh!

rdblue commented Dec 12, 2014

Uh oh!

tomwhite Dec 15, 2014

Uh oh!

rdblue Dec 17, 2014

Uh oh!

julienledem commented Jan 8, 2015

Uh oh!

rdblue commented Jan 28, 2015

Uh oh!

tomwhite commented Feb 2, 2015

Uh oh!

rdblue commented Feb 4, 2015

Uh oh!

julienledem commented Feb 4, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rdblue commented Dec 4, 2014

Uh oh!

rdblue commented Dec 12, 2014

Uh oh!

tomwhite Dec 15, 2014

Choose a reason for hiding this comment

Uh oh!

rdblue Dec 17, 2014

Choose a reason for hiding this comment

Uh oh!

julienledem commented Jan 8, 2015

Uh oh!

rdblue commented Jan 28, 2015

Uh oh!

tomwhite commented Feb 2, 2015

Uh oh!

rdblue commented Feb 4, 2015

Uh oh!

julienledem commented Feb 4, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

rdblue commented Feb 5, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants