-
Notifications
You must be signed in to change notification settings - Fork 287
Closed
2 / 22 of 2 issues completedClosed
2 / 22 of 2 issues completed
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
What is the problem the feature request solves?
Currently Apache DataFusion Comet reads the data from underlying sources using builtin Comet reader which lacks support for nested types processing.
There is an experimental feature
val COMET_NATIVE_SCAN_IMPL: ConfigEntry[String] = conf("spark.comet.scan.impl")
.doc(
s"The implementation of Comet Native Scan to use. Available modes are '$SCAN_NATIVE_COMET'," +
s"'$SCAN_NATIVE_DATAFUSION', and '$SCAN_NATIVE_ICEBERG_COMPAT'. " +
s"'$SCAN_NATIVE_COMET' is for the original Comet native scan which uses a jvm based " +
"parquet file reader and native column decoding. Supports simple types only " +
s"'$SCAN_NATIVE_DATAFUSION' is a fully native implementation of scan based on DataFusion" +
s"'$SCAN_NATIVE_ICEBERG_COMPAT' is a native implementation that exposes apis to read " +
"parquet columns natively.")
.internal()
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set(SCAN_NATIVE_COMET, SCAN_NATIVE_DATAFUSION, SCAN_NATIVE_ICEBERG_COMPAT))
.createWithDefault(sys.env
.getOrElse("COMET_PARQUET_SCAN_IMPL", SCAN_NATIVE_COMET)
.toLowerCase(Locale.ROOT))
to scan the data using DataFusion native reader which supports Arrow nested types, however the reader has to be able to read data from remote HDFS filesystem.
There are some object store implementations available to work with HDFS which are
- https://github.com/datafusion-contrib/datafusion-objectstore-hdfs(native object store on top of libhdfs and JVM. More mem usage but has richer client setting support, like retry, network options, etc
- https://github.com/datafusion-contrib/hdfs-native-object-store having less client options but no JVM dependency
Subtasks
- Create optional HDFS feature for Comet #1337
- Use HDFS file system based on some parameter (schema or
spark.defaultFS) #1360 - Remote HDFS tests with Minikube #1367
- Documentation
- Add separate HDFS submodule to Comet #1368
- Enable hdfs test(s) in ci #1515
Describe the potential solution
No response
Additional context
No response
Reactions are currently unavailable
Sub-issues
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request