-
Notifications
You must be signed in to change notification settings - Fork 2.2k
[Fix][Connector-File] Fix parquet support user config schema #9596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix][Connector-File] Fix parquet support user config schema #9596
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes the Parquet connector to support user-defined schema field types by modifying the ParquetReadStrategy class to handle type conversion between Parquet native types and user-configured types.
- Enhanced
ParquetReadStrategyto support user-defined schema field types with proper type conversion - Added comprehensive test coverage for various data types including primitives, collections, and complex types
- Extended type conversion utilities to support float/double to decimal conversions
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
ParquetReadStrategy.java |
Core implementation to support user-defined schema types with conversion logic |
ParquetReadStrategyTest.java |
Added comprehensive test case and expanded test data generation |
BaseHdfsFileSource.java |
Updated to use new schema handling method for Parquet files |
TypeUtil.java |
Extended type conversion support for float/double to decimal |
test_user_config_read_parquet.conf |
Test configuration file for user-defined schema validation |
| resolveObject(value, valueType))); | ||
| return dataMap; | ||
| case BOOLEAN: | ||
| return Boolean.parseBoolean(field.toString()); |
Copilot
AI
Jul 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean.parseBoolean() only returns true for the string "true" (case-insensitive), but Parquet boolean fields are typically already Boolean objects. This could cause incorrect conversion of true boolean values to false if they're not string representations.
| return Boolean.parseBoolean(field.toString()); | |
| if (field instanceof Boolean) { | |
| return field; | |
| } else { | |
| String errorMsg = String.format("Expected a Boolean type but got [%s]", field.getClass().getName()); | |
| throw new FileConnectorException(CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE, errorMsg); | |
| } |
| return dataMap; | ||
| case BOOLEAN: | ||
| return Boolean.parseBoolean(field.toString()); | ||
| case INT: |
Copilot
AI
Jul 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer.parseInt() will throw NumberFormatException if the field is already an Integer object, as toString() on Integer objects returns valid integer strings, but this unnecessary conversion could fail for edge cases or if field is null.
| case INT: | |
| case INT: | |
| if (field instanceof Integer) { | |
| return field; | |
| } | |
| if (field == null) { | |
| throw new FileConnectorException( | |
| CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE, | |
| "Field is null and cannot be converted to Integer."); | |
| } |
| int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName); | ||
|
|
||
| return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex); |
Copilot
AI
Jul 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName) creates a new ArrayList and performs linear search for each field lookup. Consider using a HashMap to cache field name to index mappings for better performance when processing multiple fields.
| int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName); | |
| return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex); | |
| initializeFieldNameToIndexCache(configRowType); | |
| Integer fieldIndex = fieldNameToIndexCache.get(fieldName); | |
| return fieldIndex == null ? null : configRowType.getFieldType(fieldIndex); |
| record1.put("salary", 50000.0); | ||
| record1.put("age", 30); | ||
| record1.put("active", true); | ||
| record1.put("score", 98.5f); |
Copilot
AI
Jul 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test data puts a float value (98.5f) into the 'score' field, but the schema defines it as 'double' type. This type mismatch could cause issues in the Parquet writer.
| record1.put("score", 98.5f); | |
| record1.put("score", 98.5); |
| record2.put("salary", 60000.0); | ||
| record2.put("age", 35); | ||
| record2.put("active", false); | ||
| record2.put("score", 89.2f); |
Copilot
AI
Jul 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as line 408 - putting a float value into a field defined as double type in the schema.
| record2.put("score", 89.2f); | |
| record2.put("score", 89.2); |
Hisoka-X
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @JeremyXin
Purpose of this pull request
Fix #9140
Modify
ParquetReadStrategyto support user-defined schema field typesDoes this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide