chore: Adding an optional hdfs crate#1377
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1377 +/- ##
=============================================
- Coverage 56.12% 39.06% -17.06%
- Complexity 976 2071 +1095
=============================================
Files 119 263 +144
Lines 11743 60746 +49003
Branches 2251 12909 +10658
=============================================
+ Hits 6591 23733 +17142
- Misses 4012 32530 +28518
- Partials 1140 4483 +3343 ☔ View full report in Codecov by Sentry. |
# Conflicts: # native/core/Cargo.toml
parthchandra
left a comment
There was a problem hiding this comment.
Can we have a README that covers any changes we have made (if any), and what versions of hdfs client are supported (and on what platforms)?
Also, if we have made changes, can we add a comment in the code itself to mark the areas where the code has been updated?
|
From the org of datafusion-contrib, I see many hdfs crates, which one is best for comet? |
I know it was brought up in the original issue, but just plugging my pure Rust implementation: https://github.com/datafusion-contrib/hdfs-native-object-store. It's already integrated in delta-rs and delta-kernel-rs. Though this arguably could be the one time a JNI based implementation could make sense since you're guaranteed to have Java installed and probably your classpath already set correctly since you're running Spark. Doesn't seem like something that should have the implementation copied into the Comet repo, as it seems out of scope. |
native/hdfs/Cargo.toml
Outdated
| # Example: JAVA_HOME="/opt/homebrew/opt/openjdk@11" cargo build --features=hdfs | ||
|
|
||
| [package] | ||
| name = "datafusion-objectstore-hdfs" |
There was a problem hiding this comment.
Perhaps this should start with datafusion-comet- ?
native/hdfs/Cargo.toml
Outdated
|
|
||
| [package] | ||
| name = "datafusion-objectstore-hdfs" | ||
| description = "Comet HDFS integration. Initiated by Yanghong Zhong <[email protected]> (https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)" |
There was a problem hiding this comment.
I think that it would be better to give credit in https://github.com/apache/datafusion-comet/blob/main/NOTICE.txt
Hey @Kimahriman its nice to see you here, we checked the contribution crate https://github.com/datafusion-contrib/hdfs-native-object-store and https://github.com/datafusion-contrib/datafusion-objectstore-hdfs. I really like that the crate you mentioned because it has less memory footprint and has no JVM dependency and therefore no JVM roundtrips. Can't wait this crate to grow up and use it. For now |
This is a good point. |
If there's custom settings you rely on please file an issue! We've been using this internally a decent amount and haven't seen many issues at all. Also curious how Comet handles custom Hadoop configs for accessing object stores (i.e. |
Thats a nice point, I would hope the crate matures as fast as possible and it is good idea to create feature requests, I'll do it. |
|
I followed up on some comments, otherwise LGTM |
All of Comet's hdfs interaction is in the Parquet |
Yeah that's all I was referring to, was wondering how Hadoop configs would get translated to datafusion scan/object store configs. I thought maybe the whole thing would still go through the JNI for hadoop file system interactions. I guess you could actually use the JNI based object store in this PR for any Hadoop file system |
|
A small question that is it possible to use this crate to access other hadoop compatible protocol filesystem? (the libhdfs.so can do this) |
Right now, Comet is losing the configs passed in to hadoop which is something we will address (soon?). Converting those configs to Datafusion configs will have to be done as and when we encounter them. To start with, the configs from hadoop-aws would probably be the first to be handled. I haven't dug any deeper into this yet. |
|
@kazuyukitanimura @parthchandra @andygrove appreciate if you can have another look? |
| run: | | ||
| apt-get update | ||
| apt-get install -y protobuf-compiler | ||
| apt-get install -y clang |
There was a problem hiding this comment.
So we did not have clang before?
There was a problem hiding this comment.
I think we dont' as fs-hdfs failed before that it was not able to find libclang*.so files
| dependencies = [ | ||
| "backtrace", | ||
| "bytes", | ||
| "parking_lot", |
There was a problem hiding this comment.
wondering why this existing dependency has changed
Co-authored-by: Andy Grove <[email protected]>
Co-authored-by: Andy Grove <[email protected]>
|
Thanks everyone |
|
Hey, is there a reason why not using hdfs-sys or hdrs? it looks like it support many more versions of hdfs (versions 2.2 to 3.3) in a single crate instead of 2 |
|
thanks @rluvaton The best alternative I can see now is https://github.com/datafusion-contrib/hdfs-native-object-store when the hdfs client parameter support has been extended |
* Adding an optional `hdfs` crate * Update NOTICE.txt Co-authored-by: Andy Grove <[email protected]> * Update NOTICE.txt Co-authored-by: Andy Grove <[email protected]> --------- Co-authored-by: Andy Grove <[email protected]>
Which issue does this PR close?
Closes #1368 .
Rationale for this change
What changes are included in this PR?
How are these changes tested?