-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Is your feature request related to a problem or challenge?
In Apache DataFusion Comet during implementation to handle ARRAY types from Apache Spark it was found that the inner field hardcoded name is different is Arrow-rs and Apache Spark.
The inner ListType field is hardcoded to item in https://github.com/apache/arrow-rs/blob/f4fde769ab6e1a9b75f890b7f8b47bc22800830b/arrow-schema/src/field.rs#L130
However it is a element for Apache Spark
scala> spark.sql("select array(1, 2, 3)").printSchema
root
|-- array(1, 2, 3): array (nullable = false)
| |-- element: integer (containsNull = false)
Because of this discrepancy the schema failed when the record batch gets created
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 309.0 failed 1 times, most
recent failure: Lost task 0.0 in stage 309.0 (TID 797) (Mac-1741305812954.local executor driver):
org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types,
expected List(Field { name: "element", data_type: Int8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata:
{} }) but found List(Field { name: "item", data_type: Int8, nullable: true, dict_id: 0, dict_is_ordered: false,
metadata: {} }) at column index 0
In DataFusion the List creation method Field::new_list_field with hardcoded field name is heavily used. The ticket idea is to find a way how to parametrize this.
- Replace
Field::new_list_fieldwithField::newwhich gives an opportunity to provide a custom name. However those methods are often called from the context where is noSessionContextexist and thus there is no possibility to access to config variable where new name can be parametrized - Make the name parametrized in arrow-rs, unfortunately there is no external config in arrow-rs. It is possible to leverage ENV vars but this is usually not a good way to go
- Change
RecordBatch::try_newand for ListTypes avoid checking inner naming just check the inner datatype and other fields exceptname
Related apache/datafusion-comet#1456
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response