feat: add experimental remote HDFS support for native DataFusion reader#1359
feat: add experimental remote HDFS support for native DataFusion reader#1359comphead merged 8 commits intoapache:mainfrom
Conversation
native/core/Cargo.toml
Outdated
| object_store = { workspace = true } | ||
| url = { workspace = true } | ||
| chrono = { workspace = true } | ||
| datafusion-objectstore-hdfs = { git = "https://github.com/comphead/datafusion-objectstore-hdfs", branch = "master", optional = true } |
There was a problem hiding this comment.
@andygrove I'm keeping the updated HDFS object storage in personal repo for now, let me know if there any concerns
There was a problem hiding this comment.
Is there an expected timeline for when we can move to an official release? Meantime, since we have pointed to a personal repo in the past, it is reasonable to do so for this as well (especially since this is already behind some configuration flags).
native/core/src/execution/planner.rs
Outdated
| assert_eq!(file_groups.len(), partition_count); | ||
|
|
||
| let object_store_url = ObjectStoreUrl::local_filesystem(); | ||
| let object_store_url = ObjectStoreUrl::parse("hdfs://namenode:9000").unwrap(); |
There was a problem hiding this comment.
The url should be available as part of the file path passed in. (see line 1178 above)
There was a problem hiding this comment.
Thanks @parthchandra it is already fixed.
| session_context: Arc<SessionContext>, | ||
| ) -> Result<(), ExecutionError> { | ||
| // TODO: read the namenode configuration from file schema or from spark.defaultFS | ||
| let url = Url::try_from("hdfs://namenode:9000").unwrap(); |
| } | ||
| } | ||
|
|
||
| #[cfg(not(feature = "hdfs"))] |
There was a problem hiding this comment.
hdfs cargo feature makes a conditional compilation if hdfs needed
| pub(crate) fn register_object_store( | ||
| session_context: Arc<SessionContext>, | ||
| ) -> Result<(), ExecutionError> { | ||
| let object_store = object_store::local::LocalFileSystem::new(); |
There was a problem hiding this comment.
It doesn't have to be only a local file system.
There was a problem hiding this comment.
It depends on the feature enabled for the Comet. LocalFileSystem is by default if no specific features selected.
the annotation on this method is
#[cfg(not(feature = "hdfs"))]
This allows to plugin other features like S3, etc
This particular method is responsible for no remote feature selected e.g. for local filesystem.
If a feature selected the conditional compilation will register an object store related to the feature, like HDFS or S3
native/core/src/execution/planner.rs
Outdated
| assert_eq!(file_groups.len(), partition_count); | ||
|
|
||
| let object_store_url = ObjectStoreUrl::local_filesystem(); | ||
| let object_store_url = ObjectStoreUrl::parse("hdfs://namenode:9000").unwrap(); |
There was a problem hiding this comment.
The url should be available as part of the file path passed in. (see line 1178 above)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1359 +/- ##
=============================================
- Coverage 56.12% 39.32% -16.81%
- Complexity 976 2081 +1105
=============================================
Files 119 265 +146
Lines 11743 61132 +49389
Branches 2251 12962 +10711
=============================================
+ Hits 6591 24040 +17449
- Misses 4012 32587 +28575
- Partials 1140 4505 +3365 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@andygrove @parthchandra @mbutrovich @kazuyukitanimura can I have a review please? |
parthchandra
left a comment
There was a problem hiding this comment.
Is there a way to 'mock' hdfs and write a unit test? I suppose using the hdfs support to read a local file should do just as well.
Makefile
Outdated
| ./mvnw install -Prelease -DskipTests $(PROFILES) | ||
| release: | ||
| cd native && RUSTFLAGS="-Ctarget-cpu=native" cargo build --release | ||
| cd native && RUSTFLAGS="$(RUSTFLAGS) -Ctarget-cpu=native" && RUSTFLAGS=$$RUSTFLAGS cargo build --release $(FEATURES_ARG) |
There was a problem hiding this comment.
$$RUSTFLAGS ? Never seen this pattern before. What does this do?
There was a problem hiding this comment.
Makefile syntax lightly different from bash from what I learned.
so in order to access environment variable created on fly it is needed to access it with $$.
It gets created on fly to concatenate release specific RUSTFLAGS with ones that user can set
There was a problem hiding this comment.
Thanks. Learnt something new today :)
| .register_object_store(&url, Arc::new(object_store)); | ||
| // By default, local FS object store registered | ||
| // if `hdfs` feature enabled then HDFS file object store registered | ||
| let object_store_url = register_object_store(Arc::clone(&self.session_ctx))?; |
There was a problem hiding this comment.
Should we update this function (get_file_path) as well?
It's currently used by NATIVE_ICEBERG_COMPAT but the goal is to unify it with COMET_DATAFUSION.
There was a problem hiding this comment.
Thats a good point, to verify it we probably need to read Iceberg from HDFS which can be done in #1367
There was a problem hiding this comment.
We don't need to wait for actual iceberg integration. CometScan will use COMPAT_ICEBERG if the configuration is set (That's how we are able to run the unit tests).
There was a problem hiding this comment.
@comphead we can log a follow up issue to update get_file_path if you like.
There was a problem hiding this comment.
Thanks @parthchandra lets create a followup ticket. Appreciate if you do it as I'm afraid I can miss some Iceberg details in ticket description
There was a problem hiding this comment.
#1407. There's no detail in the PR. Can you assign to me if possible, and I'll remember to take care of it.
|
Thanks @parthchandra the integration test will be addressed as part of #1367. We also need to think should it be a separate flow in CI? |
| pub(crate) fn register_object_store( | ||
| session_context: Arc<SessionContext>, | ||
| ) -> Result<ObjectStoreUrl, ExecutionError> { | ||
| // TODO: read the namenode configuration from file schema or from spark.defaultFS |
There was a problem hiding this comment.
Do we need to register object store from native_scan.file_partitions?
There was a problem hiding this comment.
Thanks @wForget I'm not sure I'm getting it, do you mean the better place to register the object store will be inside file_partitions iterator loop ?
There was a problem hiding this comment.
do you mean the better place to register the object store will be inside file_partitions iterator loop ?
Yes, is it possible that native scan paths correspond to multiple object stores or are different from spark.defaultFs?
There was a problem hiding this comment.
for HDFS/S3 the default fs can be taken from spark.hadoop.fs.defaultFS parameter.
To support multiple object stores that is interesting idea however I'm not sure when it can be addressed
There was a problem hiding this comment.
for HDFS/S3 the default fs can be taken from spark.hadoop.fs.defaultFS parameter.
Sometimes I also access other hdfs ns like:
select * from `parquet`.`hdfs://other-ns:8020/warehouse/db/table`
There was a problem hiding this comment.
that is interesting scenario, I'll add a separate test case for this
|
oops, @comphead would you mind merging the latest main into this PR branch in order to resolve the conflict? |
# Conflicts: # native/core/src/parquet/parquet_support.rs
Resolved, please have another look |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
LGTM unless @parthchandra has more comments
| Provide the JRE linker path in `RUSTFLAGS`, the path can vary depending on the system. Typically JRE linker is a part of installed JDK | ||
|
|
||
| ```shell | ||
| export JAVA_HOME="/opt/homebrew/opt/openjdk@11" |
There was a problem hiding this comment.
nit: is JAVA_HOME still the requirement?
|
Thanks everyone |
…er (apache#1359) * feat: add experimental remote HDFS support for native DataFusion reader
Which issue does this PR close?
Closes #1337.
Depends on #1368
Rationale for this change
What changes are included in this PR?
How are these changes tested?
Manually starting a remote hdfs cluster and running