Skip to content

[Experimental] Integrate Comet native reader with remote HDFS #1336

@comphead

Description

@comphead

What is the problem the feature request solves?

Currently Apache DataFusion Comet reads the data from underlying sources using builtin Comet reader which lacks support for nested types processing.

There is an experimental feature

  val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] = conf("spark.comet.scan.impl")
    .doc(
      s"The implementation of Comet Native Scan to use. Available modes are '$SCAN_NATIVE_COMET'," +
        s"'$SCAN_NATIVE_DATAFUSION', and '$SCAN_NATIVE_ICEBERG_COMPAT'. " +
        s"'$SCAN_NATIVE_COMET' is for the original Comet native scan which uses a jvm based " +
        "parquet file reader and native column decoding. Supports simple types only " +
        s"'$SCAN_NATIVE_DATAFUSION' is a fully native implementation of scan based on DataFusion" +
        s"'$SCAN_NATIVE_ICEBERG_COMPAT' is a native implementation that exposes apis to read " +
        "parquet columns natively.")
    .internal()
    .stringConf
    .transform(_.toLowerCase(Locale.ROOT))
    .checkValues(Set(SCAN_NATIVE_COMET, SCAN_NATIVE_DATAFUSION, SCAN_NATIVE_ICEBERG_COMPAT))
    .createWithDefault(sys.env
      .getOrElse("COMET_PARQUET_SCAN_IMPL", SCAN_NATIVE_COMET)
      .toLowerCase(Locale.ROOT))

to scan the data using DataFusion native reader which supports Arrow nested types, however the reader has to be able to read data from remote HDFS filesystem.

There are some object store implementations available to work with HDFS which are

Subtasks

Describe the potential solution

No response

Additional context

No response

Sub-issues

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions