Skip to content

Conversation

@JeremyXin
Copy link
Contributor

Purpose of this pull request

Fix #9140

Modify ParquetReadStrategy to support user-defined schema field types

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes the Parquet connector to support user-defined schema field types by modifying the ParquetReadStrategy class to handle type conversion between Parquet native types and user-configured types.

  • Enhanced ParquetReadStrategy to support user-defined schema field types with proper type conversion
  • Added comprehensive test coverage for various data types including primitives, collections, and complex types
  • Extended type conversion utilities to support float/double to decimal conversions

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ParquetReadStrategy.java Core implementation to support user-defined schema types with conversion logic
ParquetReadStrategyTest.java Added comprehensive test case and expanded test data generation
BaseHdfsFileSource.java Updated to use new schema handling method for Parquet files
TypeUtil.java Extended type conversion support for float/double to decimal
test_user_config_read_parquet.conf Test configuration file for user-defined schema validation

resolveObject(value, valueType)));
return dataMap;
case BOOLEAN:
return Boolean.parseBoolean(field.toString());
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean.parseBoolean() only returns true for the string "true" (case-insensitive), but Parquet boolean fields are typically already Boolean objects. This could cause incorrect conversion of true boolean values to false if they're not string representations.

Suggested change
return Boolean.parseBoolean(field.toString());
if (field instanceof Boolean) {
return field;
} else {
String errorMsg = String.format("Expected a Boolean type but got [%s]", field.getClass().getName());
throw new FileConnectorException(CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE, errorMsg);
}

Copilot uses AI. Check for mistakes.
return dataMap;
case BOOLEAN:
return Boolean.parseBoolean(field.toString());
case INT:
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.parseInt() will throw NumberFormatException if the field is already an Integer object, as toString() on Integer objects returns valid integer strings, but this unnecessary conversion could fail for edge cases or if field is null.

Suggested change
case INT:
case INT:
if (field instanceof Integer) {
return field;
}
if (field == null) {
throw new FileConnectorException(
CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE,
"Field is null and cannot be converted to Integer.");
}

Copilot uses AI. Check for mistakes.
Comment on lines +498 to +500
int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName);

return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex);
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName) creates a new ArrayList and performs linear search for each field lookup. Consider using a HashMap to cache field name to index mappings for better performance when processing multiple fields.

Suggested change
int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName);
return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex);
initializeFieldNameToIndexCache(configRowType);
Integer fieldIndex = fieldNameToIndexCache.get(fieldName);
return fieldIndex == null ? null : configRowType.getFieldType(fieldIndex);

Copilot uses AI. Check for mistakes.
record1.put("salary", 50000.0);
record1.put("age", 30);
record1.put("active", true);
record1.put("score", 98.5f);
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test data puts a float value (98.5f) into the 'score' field, but the schema defines it as 'double' type. This type mismatch could cause issues in the Parquet writer.

Suggested change
record1.put("score", 98.5f);
record1.put("score", 98.5);

Copilot uses AI. Check for mistakes.
record2.put("salary", 60000.0);
record2.put("age", 35);
record2.put("active", false);
record2.put("score", 89.2f);
Copy link

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as line 408 - putting a float value into a field defined as double type in the schema.

Suggested change
record2.put("score", 89.2f);
record2.put("score", 89.2);

Copilot uses AI. Check for mistakes.
Copy link
Member

@Hisoka-X Hisoka-X left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @JeremyXin

@corgy-w corgy-w merged commit 2bdaeb6 into apache:dev Jul 23, 2025
7 checks passed
dybyte pushed a commit to dybyte/seatunnel that referenced this pull request Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][connector-file-hadoop] When reading parquet file field value as BINARY type, write downstream data exception

3 participants