Skip to content

Conversation

@saurabhd336
Copy link
Contributor

@saurabhd336 saurabhd336 commented Aug 29, 2022

This PR allows parquet reader to automatically decide b/w ParquetAvroRecordReader and ParquetNativeRecordReader based on the parquet file's metadata.
The reader config can be used to enforce the reader, but the default behaviour is to infer the type based on file schema

Reader config flags
setUseParquetNativeRecordReader(true) -> Use ParquetNativeRecordReader
setUseParquetAvroRecordReader(true) -> Use ParquetAvroRecordReader
default -> Infer reader type based on file metadata

@xiangfu0
Copy link
Contributor

xiangfu0 commented Aug 30, 2022

Can you add a sample data file with a decimal field and a test to ensure the file is correctly parsed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might throw null pointer exception here if fileKeyValueMeta is null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put this check hasAvroSchemaInParquetFile() inside org.apache.pinot.plugin.inputformat.parquet.ParquetUtils and reuse the same method inside org.apache.pinot.plugin.inputformat.parquet.ParquetUtils.getParquetAvroSchema(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@codecov-commenter
Copy link

codecov-commenter commented Aug 30, 2022

Codecov Report

Merging #9294 (ed4d8e4) into master (d1a71d8) will decrease coverage by 2.78%.
The diff coverage is 46.56%.

❗ Current head ed4d8e4 differs from pull request most recent head 711add8. Consider uploading reports for the commit 711add8 to get more accurate results

@@             Coverage Diff              @@
##             master    #9294      +/-   ##
============================================
- Coverage     69.82%   67.04%   -2.79%     
- Complexity     4696     4824     +128     
============================================
  Files          1873     1391     -482     
  Lines         99623    72184   -27439     
  Branches      15146    11583    -3563     
============================================
- Hits          69564    48396   -21168     
+ Misses        25118    20266    -4852     
+ Partials       4941     3522    -1419     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 67.04% <46.56%> (-0.08%) ⬇️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...aggregation/function/AggregationFunctionUtils.java 80.00% <ø> (+0.51%) ⬆️
...not/query/planner/logical/RelToStageConverter.java 73.91% <ø> (ø)
.../columnminmaxvalue/ColumnMinMaxValueGenerator.java 73.68% <0.00%> (ø)
...a/org/apache/pinot/segment/spi/ColumnMetadata.java 80.00% <0.00%> (-20.00%) ⬇️
...java/org/apache/pinot/segment/spi/V1Constants.java 12.50% <ø> (ø)
...ator/transform/function/BaseTransformFunction.java 46.58% <20.00%> (-4.92%) ⬇️
...java/org/apache/pinot/core/common/DataFetcher.java 77.81% <42.30%> (-11.93%) ⬇️
...e/pinot/core/transport/InstanceRequestHandler.java 60.15% <60.00%> (-4.69%) ⬇️
.../java/org/apache/pinot/query/QueryEnvironment.java 82.75% <80.00%> (-5.00%) ⬇️
...ator/transform/function/CastTransformFunction.java 84.00% <100.00%> (+13.35%) ⬆️
... and 735 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Jackie-Jiang
Copy link
Contributor

The test failure might be related:

2022-08-31T07:17:28.7248119Z [ERROR] Failures: 
2022-08-31T07:17:28.7249092Z [ERROR]   ParquetRecordReaderTest.testComparison:105->testComparison:125 expected [false] but found [true]

@saurabhd336
Copy link
Contributor Author

The test failure might be related:

2022-08-31T07:17:28.7248119Z [ERROR] Failures: 
2022-08-31T07:17:28.7249092Z [ERROR]   ParquetRecordReaderTest.testComparison:105->testComparison:125 expected [false] but found [true]

@Jackie-Jiang ACK. Fixed the test and added a new test to validate file metadata based reader selection

@Jackie-Jiang Jackie-Jiang added enhancement Configuration Config changes (addition/deletion/change in behavior) labels Sep 1, 2022
@Jackie-Jiang Jackie-Jiang merged commit d1bad1d into apache:master Sep 1, 2022
@Jackie-Jiang
Copy link
Contributor

Can you please modify the PR description to include the new config key for the record reader config? Also update the Pinot doc where applicable

@Jackie-Jiang Jackie-Jiang changed the title Infer parquet reader type based on file metadata (wip) Infer parquet reader type based on file metadata Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Configuration Config changes (addition/deletion/change in behavior) enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants