feat: introduce hadoop mini cluster to test native scan on hdfs by wForget · Pull Request #1556 · apache/datafusion-comet

wForget · 2025-03-19T08:23:39Z

Which issue does this PR close?

Closes #1515.

Rationale for this change

test native scan on hdfs

What changes are included in this PR?

introduce hadoop mini cluster to test native scan on hdfs

How are these changes tested?

Successfully run CometReadHdfsBenchmark locally (tips: build native enable hdfs: cd native && cargo build --features hdfs)

wForget · 2025-03-19T08:24:59Z

native/core/Cargo.toml

 url = { workspace = true }
 parking_lot = "0.12.3"
-datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true}
+datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-features = false, features = ["hdfs"] }


disable try_spawn_blocking to avoid native thread hanging

@wForget any idea what causes the native thread to hang when try_spawn_blocking is used?

@wForget any idea what causes the native thread to hang when try_spawn_blocking is used?

Sorry, I didn’t investigate this issue more thoroughly.

wForget · 2025-03-19T08:26:49Z

spark/src/test/scala/org/apache/comet/WithHdfsCluster.scala

+import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys
+import org.apache.spark.internal.Logging
+
+trait WithHdfsCluster extends Logging {


most copy from kyuubi

Let's leave a comment that this was taken from kyuubi

Thanks, added

codecov-commenter · 2025-03-19T09:09:35Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.39%. Comparing base (f09f8af) to head (cb7d0b2).
Report is 96 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1556      +/-   ##
============================================
+ Coverage     56.12%   58.39%   +2.26%     
- Complexity      976      977       +1     
============================================
  Files           119      122       +3     
  Lines         11743    12217     +474     
  Branches       2251     2280      +29     
============================================
+ Hits           6591     7134     +543     
+ Misses         4012     3951      -61     
+ Partials       1140     1132       -8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

parthchandra · 2025-03-20T21:59:51Z

pom.xml


+      <dependency>
+        <groupId>org.apache.hadoop</groupId>
+        <artifactId>hadoop-client-minicluster</artifactId>


Do we need this dependency instead? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-minicluster/3.3.4
Not sure what the difference is as long as both allow us to spin up a miniDFSCluster

My understanding is that hadoop-client-minicluster has fewer dependencies, and it depends on hadoop-client-runtime which is a shaded hadoop client (to avoid introducing conflicts)

https://github.com/apache/hadoop/blob/trunk/hadoop-client-modules/hadoop-client-minicluster/pom.xml

hadoop-client-minicluster seems to be a fat jar of hadoop mini cluster, so is it more suitable as a dependency for testing?

Ah, I did not know that. It doesn't matter which one we use then.

parthchandra · 2025-03-20T22:04:27Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometReadBenchmark.scala

        sqlBenchmark.addCase("SQL Parquet - Comet") { _ =>
          withSQLConf(
            CometConf.COMET_ENABLED.key -> "true",
+            CometConf.COMET_EXEC_ENABLED.key -> "true",


Why is this needed?

CometBenchmarkBase sets COMET_EXEC_ENABLED to false by default, while native scan requires COMET_EXEC_ENABLED=true

https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/spark/sql/benchmark/CometBenchmarkBase.scala#L53-L54

datafusion-comet/spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

Line 213 in 5224108

s"Full native scan disabled because ${COMET_EXEC_ENABLED.key} disabled"

Line 68 is indeed redundant configuration, I will remove it.

pom.xml

kazuyukitanimura

My only last question is whether we should enable native exec for the microbenchmarks

kazuyukitanimura · 2025-03-24T18:14:49Z

spark/src/test/scala/org/apache/spark/sql/benchmark/CometReadBenchmark.scala

        sqlBenchmark.addCase("SQL Parquet - Comet Native DataFusion") { _ =>
          withSQLConf(
            CometConf.COMET_ENABLED.key -> "true",
+            CometConf.COMET_EXEC_ENABLED.key -> "true",


I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark

I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark

It seems difficult to benchmark only scan anyway. If we disable exec conversion, it may introduce the performance loss of ColumnarToRow.

kazuyukitanimura · 2025-03-25T22:08:36Z

Merged thanks @wForget

…he#1556) ## Which issue does this PR close? Closes apache#1515. ## Rationale for this change test native scan on hdfs ## What changes are included in this PR? introduce hadoop mini cluster to test native scan on hdfs ## How are these changes tested? Successfully run `CometReadHdfsBenchmark` locally (tips: build native enable hdfs: `cd native && cargo build --features hdfs`)

wForget commented Mar 19, 2025

View reviewed changes

parthchandra reviewed Mar 20, 2025

View reviewed changes

wForget added 2 commits March 21, 2025 10:54

feat: introduce hadoop mini cluster to test native scan on hdfs

fc5d0d2

address comments

3aeea35

wForget force-pushed the hdfs-tet-2 branch from 0264e14 to 3aeea35 Compare March 21, 2025 03:06

fix style

cb7d0b2

kazuyukitanimura reviewed Mar 21, 2025

View reviewed changes

pom.xml Show resolved Hide resolved

kazuyukitanimura approved these changes Mar 24, 2025

View reviewed changes

kazuyukitanimura merged commit 49fa287 into apache:main Mar 25, 2025
68 checks passed

Conversation

wForget commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wForget Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kazuyukitanimura commented Mar 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wForget commented Mar 19, 2025 •

edited

Loading

codecov-commenter commented Mar 19, 2025 •

edited

Loading

wForget Mar 21, 2025 •

edited

Loading