Skip to content

Parametrize ListArray inner field #15162

@comphead

Description

@comphead

Is your feature request related to a problem or challenge?

In Apache DataFusion Comet during implementation to handle ARRAY types from Apache Spark it was found that the inner field hardcoded name is different is Arrow-rs and Apache Spark.

The inner ListType field is hardcoded to item in https://github.com/apache/arrow-rs/blob/f4fde769ab6e1a9b75f890b7f8b47bc22800830b/arrow-schema/src/field.rs#L130

However it is a element for Apache Spark

scala> spark.sql("select array(1, 2, 3)").printSchema
root
 |-- array(1, 2, 3): array (nullable = false)
 |    |-- element: integer (containsNull = false)

Because of this discrepancy the schema failed when the record batch gets created

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 309.0 failed 1 times, most
 recent failure: Lost task 0.0 in stage 309.0 (TID 797) (Mac-1741305812954.local executor driver): 
org.apache.comet.CometNativeException: Invalid argument error: column types must match schema types, 
expected List(Field { name: "element", data_type: Int8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: 
{} }) but found List(Field { name: "item", data_type: Int8, nullable: true, dict_id: 0, dict_is_ordered: false, 
metadata: {} }) at column index 0

In DataFusion the List creation method Field::new_list_field with hardcoded field name is heavily used. The ticket idea is to find a way how to parametrize this.

  • Replace Field::new_list_field with Field::new which gives an opportunity to provide a custom name. However those methods are often called from the context where is no SessionContext exist and thus there is no possibility to access to config variable where new name can be parametrized
  • Make the name parametrized in arrow-rs, unfortunately there is no external config in arrow-rs. It is possible to leverage ENV vars but this is usually not a good way to go
  • Change RecordBatch::try_new and for ListTypes avoid checking inner naming just check the inner datatype and other fields except name

Related apache/datafusion-comet#1456

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions