-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Infer parquet reader type based on file metadata #9294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can you add a sample data file with a decimal field and a test to ensure the file is correctly parsed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might throw null pointer exception here if fileKeyValueMeta is null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can put this check hasAvroSchemaInParquetFile() inside org.apache.pinot.plugin.inputformat.parquet.ParquetUtils and reuse the same method inside org.apache.pinot.plugin.inputformat.parquet.ParquetUtils.getParquetAvroSchema(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
0191412 to
cae4e5e
Compare
Codecov Report
@@ Coverage Diff @@
## master #9294 +/- ##
============================================
- Coverage 69.82% 67.04% -2.79%
- Complexity 4696 4824 +128
============================================
Files 1873 1391 -482
Lines 99623 72184 -27439
Branches 15146 11583 -3563
============================================
- Hits 69564 48396 -21168
+ Misses 25118 20266 -4852
+ Partials 4941 3522 -1419
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
xiangfu0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
The test failure might be related: |
@Jackie-Jiang ACK. Fixed the test and added a new test to validate file metadata based reader selection |
|
Can you please modify the PR description to include the new config key for the record reader config? Also update the Pinot doc where applicable |
This PR allows parquet reader to automatically decide b/w
ParquetAvroRecordReaderandParquetNativeRecordReaderbased on the parquet file's metadata.The reader config can be used to enforce the reader, but the default behaviour is to infer the type based on file schema
Reader config flags
setUseParquetNativeRecordReader(true)-> UseParquetNativeRecordReadersetUseParquetAvroRecordReader(true)-> UseParquetAvroRecordReaderdefault -> Infer reader type based on file metadata