Skip to content

feat: introduce hadoop mini cluster to test native scan on hdfs#1556

Merged
kazuyukitanimura merged 3 commits intoapache:mainfrom
wForget:hdfs-tet-2
Mar 25, 2025
Merged

feat: introduce hadoop mini cluster to test native scan on hdfs#1556
kazuyukitanimura merged 3 commits intoapache:mainfrom
wForget:hdfs-tet-2

Conversation

@wForget
Copy link
Member

@wForget wForget commented Mar 19, 2025

Which issue does this PR close?

Closes #1515.

Rationale for this change

test native scan on hdfs

What changes are included in this PR?

introduce hadoop mini cluster to test native scan on hdfs

How are these changes tested?

Successfully run CometReadHdfsBenchmark locally (tips: build native enable hdfs: cd native && cargo build --features hdfs)

url = { workspace = true }
parking_lot = "0.12.3"
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true}
datafusion-comet-objectstore-hdfs = { path = "../hdfs", optional = true, default-features = false, features = ["hdfs"] }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disable try_spawn_blocking to avoid native thread hanging

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wForget any idea what causes the native thread to hang when try_spawn_blocking is used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wForget any idea what causes the native thread to hang when try_spawn_blocking is used?

Sorry, I didn’t investigate this issue more thoroughly.

import org.apache.hadoop.hdfs.client.HdfsClientConfigKeys
import org.apache.spark.internal.Logging

trait WithHdfsCluster extends Logging {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most copy from kyuubi

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave a comment that this was taken from kyuubi

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added

@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 58.39%. Comparing base (f09f8af) to head (cb7d0b2).
Report is 96 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1556      +/-   ##
============================================
+ Coverage     56.12%   58.39%   +2.26%     
- Complexity      976      977       +1     
============================================
  Files           119      122       +3     
  Lines         11743    12217     +474     
  Branches       2251     2280      +29     
============================================
+ Hits           6591     7134     +543     
+ Misses         4012     3951      -61     
+ Partials       1140     1132       -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client-minicluster</artifactId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this dependency instead? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-minicluster/3.3.4
Not sure what the difference is as long as both allow us to spin up a miniDFSCluster

Copy link
Member Author

@wForget wForget Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that hadoop-client-minicluster has fewer dependencies, and it depends on hadoop-client-runtime which is a shaded hadoop client (to avoid introducing conflicts)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/apache/hadoop/blob/trunk/hadoop-client-modules/hadoop-client-minicluster/pom.xml

hadoop-client-minicluster seems to be a fat jar of hadoop mini cluster, so is it more suitable as a dependency for testing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not know that. It doesn't matter which one we use then.

sqlBenchmark.addCase("SQL Parquet - Comet") { _ =>
withSQLConf(
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CometBenchmarkBase sets COMET_EXEC_ENABLED to false by default, while native scan requires COMET_EXEC_ENABLED=true

https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/spark/sql/benchmark/CometBenchmarkBase.scala#L53-L54

s"Full native scan disabled because ${COMET_EXEC_ENABLED.key} disabled"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 68 is indeed redundant configuration, I will remove it.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only last question is whether we should enable native exec for the microbenchmarks

sqlBenchmark.addCase("SQL Parquet - Comet Native DataFusion") { _ =>
withSQLConf(
CometConf.COMET_ENABLED.key -> "true",
CometConf.COMET_EXEC_ENABLED.key -> "true",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should enable COMET_EXEC_ENABLED as it will mix the scan benchmark and exec benchmark

It seems difficult to benchmark only scan anyway. If we disable exec conversion, it may introduce the performance loss of ColumnarToRow.

@kazuyukitanimura kazuyukitanimura merged commit 49fa287 into apache:main Mar 25, 2025
68 checks passed
@kazuyukitanimura
Copy link
Contributor

Merged thanks @wForget

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
…he#1556)

## Which issue does this PR close?

Closes apache#1515.

## Rationale for this change

test native scan on hdfs

## What changes are included in this PR?

introduce hadoop mini cluster to test native scan on hdfs

## How are these changes tested?

Successfully run `CometReadHdfsBenchmark` locally (tips: build native enable hdfs: `cd native && cargo build --features hdfs`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable hdfs test(s) in ci

5 participants