feat: introduce hadoop mini cluster to test native scan on hdfs#1556
feat: introduce hadoop mini cluster to test native scan on hdfs#1556kazuyukitanimura merged 3 commits intoapache:mainfrom
Conversation
| url = { workspace = true } | ||
| parking_lot = "0.12.3" | ||
| datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true} | ||
| datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-features = false, features = ["hdfs"] } |
There was a problem hiding this comment.
disable try_spawn_blocking to avoid native thread hanging
There was a problem hiding this comment.
@wForget any idea what causes the native thread to hang when try_spawn_blocking is used?
There was a problem hiding this comment.
@wForget any idea what causes the native thread to hang when
try_spawn_blockingis used?
Sorry, I didn’t investigate this issue more thoroughly.
| import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys | ||
| import org.apache.spark.internal.Logging | ||
|
|
||
| trait WithHdfsCluster extends Logging { |
There was a problem hiding this comment.
Let's leave a comment that this was taken from kyuubi
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1556 +/- ##
============================================
+ Coverage 56.12% 58.39% +2.26%
- Complexity 976 977 +1
============================================
Files 119 122 +3
Lines 11743 12217 +474
Branches 2251 2280 +29
============================================
+ Hits 6591 7134 +543
+ Misses 4012 3951 -61
+ Partials 1140 1132 -8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| <dependency> | ||
| <groupId>org.apache.hadoop</groupId> | ||
| <artifactId>hadoop-client-minicluster</artifactId> |
There was a problem hiding this comment.
Do we need this dependency instead? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-minicluster/3.3.4
Not sure what the difference is as long as both allow us to spin up a miniDFSCluster
There was a problem hiding this comment.
My understanding is that hadoop-client-minicluster has fewer dependencies, and it depends on hadoop-client-runtime which is a shaded hadoop client (to avoid introducing conflicts)
There was a problem hiding this comment.
hadoop-client-minicluster seems to be a fat jar of hadoop mini cluster, so is it more suitable as a dependency for testing?
There was a problem hiding this comment.
Ah, I did not know that. It doesn't matter which one we use then.
| sqlBenchmark.addCase("SQL Parquet - Comet") { _ => | ||
| withSQLConf( | ||
| CometConf.COMET_ENABLED.key -> "true", | ||
| CometConf.COMET_EXEC_ENABLED.key -> "true", |
There was a problem hiding this comment.
CometBenchmarkBase sets COMET_EXEC_ENABLED to false by default, while native scan requires COMET_EXEC_ENABLED=true
There was a problem hiding this comment.
Line 68 is indeed redundant configuration, I will remove it.
kazuyukitanimura
left a comment
There was a problem hiding this comment.
My only last question is whether we should enable native exec for the microbenchmarks
| sqlBenchmark.addCase("SQL Parquet - Comet Native DataFusion") { _ => | ||
| withSQLConf( | ||
| CometConf.COMET_ENABLED.key -> "true", | ||
| CometConf.COMET_EXEC_ENABLED.key -> "true", |
There was a problem hiding this comment.
I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark
There was a problem hiding this comment.
I am not sure if we should enable
COMET_EXEC_ENABLEDas it will mix the scan benchmark and exec benchmark
It seems difficult to benchmark only scan anyway. If we disable exec conversion, it may introduce the performance loss of ColumnarToRow.
|
Merged thanks @wForget |
…he#1556) ## Which issue does this PR close? Closes apache#1515. ## Rationale for this change test native scan on hdfs ## What changes are included in this PR? introduce hadoop mini cluster to test native scan on hdfs ## How are these changes tested? Successfully run `CometReadHdfsBenchmark` locally (tips: build native enable hdfs: `cd native && cargo build --features hdfs`)
Which issue does this PR close?
Closes #1515.
Rationale for this change
test native scan on hdfs
What changes are included in this PR?
introduce hadoop mini cluster to test native scan on hdfs
How are these changes tested?
Successfully run
CometReadHdfsBenchmarklocally (tips: build native enable hdfs:cd native && cargo build --features hdfs)