[Fix][Connector-File] Fix parquet support user config schema #9596

JeremyXin · 2025-07-21T01:40:42Z

Purpose of this pull request

Fix #9140

Modify ParquetReadStrategy to support user-defined schema field types

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

Copilot

Pull Request Overview

This PR fixes the Parquet connector to support user-defined schema field types by modifying the ParquetReadStrategy class to handle type conversion between Parquet native types and user-configured types.

Enhanced ParquetReadStrategy to support user-defined schema field types with proper type conversion
Added comprehensive test coverage for various data types including primitives, collections, and complex types
Extended type conversion utilities to support float/double to decimal conversions

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`ParquetReadStrategy.java`	Core implementation to support user-defined schema types with conversion logic
`ParquetReadStrategyTest.java`	Added comprehensive test case and expanded test data generation
`BaseHdfsFileSource.java`	Updated to use new schema handling method for Parquet files
`TypeUtil.java`	Extended type conversion support for float/double to decimal
`test_user_config_read_parquet.conf`	Test configuration file for user-defined schema validation

Copilot · 2025-07-21T07:32:22Z

...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ParquetReadStrategy.java

                                        resolveObject(value, valueType)));
                return dataMap;
            case BOOLEAN:
+                return Boolean.parseBoolean(field.toString());


Boolean.parseBoolean() only returns true for the string "true" (case-insensitive), but Parquet boolean fields are typically already Boolean objects. This could cause incorrect conversion of true boolean values to false if they're not string representations.

Suggested change

return Boolean.parseBoolean(field.toString());

if (field instanceof Boolean) {

return field;

} else {

String errorMsg = String.format("Expected a Boolean type but got [%s]", field.getClass().getName());

throw new FileConnectorException(CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE, errorMsg);

}

Copilot · 2025-07-21T07:32:22Z

...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ParquetReadStrategy.java

                return dataMap;
            case BOOLEAN:
+                return Boolean.parseBoolean(field.toString());
            case INT:


Integer.parseInt() will throw NumberFormatException if the field is already an Integer object, as toString() on Integer objects returns valid integer strings, but this unnecessary conversion could fail for edge cases or if field is null.

Suggested change

case INT:

case INT:

if (field instanceof Integer) {

return field;

}

if (field == null) {

throw new FileConnectorException(

CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE,

"Field is null and cannot be converted to Integer.");

}

Copilot · 2025-07-21T07:32:22Z

...n/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/ParquetReadStrategy.java

+        int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName);
+
+        return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex);


Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName) creates a new ArrayList and performs linear search for each field lookup. Consider using a HashMap to cache field name to index mappings for better performance when processing multiple fields.

Suggested change

int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName);

return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex);

initializeFieldNameToIndexCache(configRowType);

Integer fieldIndex = fieldNameToIndexCache.get(fieldName);

return fieldIndex == null ? null : configRowType.getFieldType(fieldIndex);

Copilot · 2025-07-21T07:32:22Z

...test/java/org/apache/seatunnel/connectors/seatunnel/file/writer/ParquetReadStrategyTest.java

            record1.put("salary", 50000.0);
+            record1.put("age", 30);
+            record1.put("active", true);
+            record1.put("score", 98.5f);


The test data puts a float value (98.5f) into the 'score' field, but the schema defines it as 'double' type. This type mismatch could cause issues in the Parquet writer.

Suggested change

record1.put("score", 98.5f);

record1.put("score", 98.5);

Copilot · 2025-07-21T07:32:23Z

...test/java/org/apache/seatunnel/connectors/seatunnel/file/writer/ParquetReadStrategyTest.java

            record2.put("salary", 60000.0);
+            record2.put("age", 35);
+            record2.put("active", false);
+            record2.put("score", 89.2f);


Same issue as line 408 - putting a float value into a field defined as double type in the schema.

Suggested change

record2.put("score", 89.2f);

record2.put("score", 89.2);

Hisoka-X

LGTM. Thanks @JeremyXin

…9596)

[Fix][Connector-File] fix parquet support user config schema

97f7468

github-actions bot added connectors-v2 file api labels Jul 21, 2025

nielifeng requested a review from Copilot July 21, 2025 07:31

Copilot AI reviewed Jul 21, 2025

View reviewed changes

Hisoka-X approved these changes Jul 22, 2025

View reviewed changes

github-actions bot added approved reviewed labels Jul 22, 2025

corgy-w approved these changes Jul 23, 2025

View reviewed changes

corgy-w merged commit 2bdaeb6 into apache:dev Jul 23, 2025
7 checks passed

dybyte pushed a commit to dybyte/seatunnel that referenced this pull request Jul 23, 2025

[Fix][Connector-File] Fix parquet support user config schema (apache#…

21d77c9

…9596)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix][Connector-File] Fix parquet support user config schema #9596

[Fix][Connector-File] Fix parquet support user config schema #9596

Uh oh!

JeremyXin commented Jul 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Copilot AI Jul 21, 2025

Uh oh!

Hisoka-X left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                return Boolean.parseBoolean(field.toString());
+                if (field instanceof Boolean) {
+                    return field;
+                } else {
+                    String errorMsg = String.format("Expected a Boolean type but got [%s]", field.getClass().getName());
+                    throw new FileConnectorException(CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE, errorMsg);
+                }

		int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName);

		return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex);

-        int fieldIndex = Arrays.asList(configRowType.getFieldNames()).indexOf(fieldName);
-        return fieldIndex == -1 ? null : configRowType.getFieldType(fieldIndex);
+        initializeFieldNameToIndexCache(configRowType);
+        Integer fieldIndex = fieldNameToIndexCache.get(fieldName);
+        return fieldIndex == null ? null : configRowType.getFieldType(fieldIndex);

[Fix][Connector-File] Fix parquet support user config schema #9596

[Fix][Connector-File] Fix parquet support user config schema #9596

Uh oh!

Conversation

JeremyXin commented Jul 21, 2025

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Hisoka-X left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants