Skip to content

PARQUET-139: Avoid reading footers when using task-side metadata#91

Closed
rdblue wants to merge 6 commits intoapache:masterfrom
rdblue:PARQUET-139-input-format-task-side
Closed

PARQUET-139: Avoid reading footers when using task-side metadata#91
rdblue wants to merge 6 commits intoapache:masterfrom
rdblue:PARQUET-139-input-format-task-side

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Dec 4, 2014

This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:

  1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema.
  2. To avoid reading file footers on the client side, which is a performance bottleneck.

Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.

This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.

@rdblue rdblue force-pushed the PARQUET-139-input-format-task-side branch from 33db70a to f6f081a Compare December 4, 2014 01:38
@rdblue
Copy link
Contributor Author

rdblue commented Dec 12, 2014

I've merged the ParquetFileInputFormat into the original ParquetInputFormat, as discussed in the sync-up. I think this is close to being ready to merge and is ready for review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to this JIRA? If not, might be best to commit separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related, but could be considered a separate bug against the Avro read support. I included it because this pull request changes the behavior so that each file schema is validated with the projection schema. This is done for all object models except for Avro because this was previously missing. I think it was omitted from Avro because the name isn't accurate for what it does.

@julienledem
Copy link
Member

thank @rdblue
I did a first pass of comments.
Overall this looks good to me.

@rdblue rdblue force-pushed the PARQUET-139-input-format-task-side branch 2 times, most recently from d8c6dbe to 5e95d8b Compare January 28, 2015 21:27
@rdblue
Copy link
Contributor Author

rdblue commented Jan 28, 2015

@julienledem and @tomwhite I think I've addressed all of the review issues, though there were a few that I explained better rather than changing. I'd like to get this in shortly and push for a 1.6.0 release, so please take another look when you have time. Thanks!

@tomwhite
Copy link
Member

tomwhite commented Feb 2, 2015

My feedback has been addressed, so I'm +1 on this once the travis run passes.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 4, 2015

@julienledem I'd like to commit these changes this week, if you have a chance to take a second look. Thanks!

@julienledem
Copy link
Member

LGTM +1
Let's plan a follow-up PR to simplify the code a little by removing the SplitStrategy remaining class.

This updates the InternalParquetRecordReader to initialize the
ReadContext in each task rather than once for an entire job. There are
two reasons for this change:

1. For correctness, the requested projection schema must be validated
   against each file schema, not once using the merged schema.
2. To avoid reading file footers on the client side, which is a
   performance bottleneck.

Because the read context is reinitialized in every task, it is no longer
necessary to pass the its contents to each task in ParquetInputSplit.
The fields and accessors have been removed.
This refactors the ReadSupport class handling in ParquetInputFormat. It
will now use the ReadSupport class if it is set on the InputFormat in
the constructor. This allows InputFormat subclasses to use a specific
ReadSupport class without setting it in the Configuration.

This also removes the use of getReadSupport in the mapred InputFormat,
where it was unnecessary.
This adds a new InputFormat that uses FileSplit. This is to avoid
incompatible behavior changes from modifying ParquetInputFormat. This
uses the same ParquetRecordReader structure.
This merges the changes from the FileSplit-based InputFormat into
ParquetInputFormat in place of the task-side split strategy. This is
mostly a refactor of the previous commit after testing the InputFormat
changes in isolation.
The ReadSupport class needs to be parameterized in all cases, and not
all cases with mapred InputSplits were correctly converted to the
mapreduce API.
@rdblue rdblue force-pushed the PARQUET-139-input-format-task-side branch from cc6bada to cb30660 Compare February 5, 2015 17:01
@rdblue
Copy link
Contributor Author

rdblue commented Feb 5, 2015

Thanks for the reviews! I've rebased and will merge this when tests pass.

@asfgit asfgit closed this in ce65dfb Feb 5, 2015
@rdblue
Copy link
Contributor Author

rdblue commented Feb 5, 2015

It looks like Travis has a huge backlog, so I tested this locally instead of waiting for the job.

rdblue added a commit to rdblue/parquet-mr that referenced this pull request Feb 6, 2015
This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:

1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema.
2. To avoid reading file footers on the client side, which is a performance bottleneck.

Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.

This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.

Author: Ryan Blue <[email protected]>

Closes apache#91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits:

cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes.
09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews.
3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat.
8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit.
87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods.
057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks.

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java
	parquet-hadoop/src/test/java/parquet/hadoop/TestInputFormat.java
Resolutions:
    ParquetInputFormat conflict from unbackported strict type checking
    ParquetInputSplit conflict from methods added back for compatibility
    Other conflicts were minor
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
This updates the InternalParquetRecordReader to initialize the ReadContext in each task rather than once for an entire job. There are two reasons for this change:

1. For correctness, the requested projection schema must be validated against each file schema, not once using the merged schema.
2. To avoid reading file footers on the client side, which is a performance bottleneck.

Because the read context is reinitialized in every task, it is no longer necessary to pass the its contents to each task in ParquetInputSplit. The fields and accessors have been removed.

This also adds a new InputFormat, ParquetFileInputFormat that uses FileSplits instead of ParquetSplits. It goes through the normal ParquetRecordReader and creates a ParquetSplit on the task side. This is to avoid accidental behavior changes in ParquetInputFormat.

Author: Ryan Blue <[email protected]>

Closes apache#91 from rdblue/PARQUET-139-input-format-task-side and squashes the following commits:

cb30660 [Ryan Blue] PARQUET-139: Fix deprecated reader bug from review fixes.
09cde8d [Ryan Blue] PARQUET-139: Implement changes from reviews.
3eec553 [Ryan Blue] PARQUET-139: Merge new InputFormat into ParquetInputFormat.
8971b80 [Ryan Blue] PARQUET-139: Add ParquetFileInputFormat that uses FileSplit.
87dfe86 [Ryan Blue] PARQUET-139: Expose read support helper methods.
057c7dc [Ryan Blue] PARQUET-139: Update reader to initialize read context in tasks.

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputSplit.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetRecordReader.java
	parquet-hadoop/src/test/java/parquet/hadoop/TestInputFormat.java
Resolutions:
    ParquetInputFormat conflict from unbackported strict type checking
    ParquetInputSplit conflict from methods added back for compatibility
    Other conflicts were minor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants