feat: Add Hash Join benchmarks by jonathanc-n · Pull Request #17636 · apache/datafusion

jonathanc-n · 2025-09-17T23:49:01Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jonathanc-n · 2025-09-17T23:49:09Z

cc @2010YOUY01

benchmarks/README.md

adriangb · 2025-09-19T01:21:15Z

benchmarks/src/hj.rs

+    // equality on key + cheap filter to downselect
+    r#"
+        SELECT t1.value, t2.value
+        FROM range(10000) AS t1


I feel it'd be better to generate actual parquet files. That's a more realistic scenario and would slow testing things like sideways information passing

I was copying over the format and some docs from the mini benchmark for nlj. The original reason the smaller dfbenches were added were to just test the pure execution of the hash join for specific comparisons (ex. perfect hash join, testing new hash algorithms, etc). We could allow for benchmarks with both data from range() and parquet files. If so we should do a follow up for adding parquet files for both HJ and NLJ benches. WDYT @2010YOUY01

The scope of this micro-benchmark is indeed limited, but I think it’s better to keep the focus on the join executor and eliminate the cost of Parquet reading. Perhaps we could add some join-focused queries by extending the TPCH or JOB benchmark for realistic scenarios.

Okay sounds good, let's proceed as is

Co-authored-by: Adrian Garcia Badaracco <[email protected]>

2010YOUY01

Thank you, it looks good in general.

For extra bench coverage, I think we can enumerate some workloads (perhaps just modify some existing queries) with additional join filters, given they're also evaluated inside HashJoin executor:
ON (t1.value = t2.value) AND (<expr with col in both t1 and t2>)
e.g.
ON (t1.value = t2.value) AND ((t1.value+t2.value)%10 = 0)

2010YOUY01 · 2025-09-19T05:03:17Z

benchmarks/src/hj.rs

+    // equality on key + cheap filter to downselect
+    r#"
+        SELECT t1.value, t2.value
+        FROM range(10000) AS t1


The scope of this micro-benchmark is indeed limited, but I think it’s better to keep the focus on the join executor and eliminate the cost of Parquet reading. Perhaps we could add some join-focused queries by extending the TPCH or JOB benchmark for realistic scenarios.

benchmarks/src/hj.rs

jonathanc-n · 2025-09-21T04:35:42Z

I added the new queries and fixed the query. Should be good to go!

…tafusion into add-hj-benchmarks

benchmarks/src/hj.rs

alamb · 2025-09-25T15:32:56Z

Shall we merge this one? Or are we still waiting to address comments?

jonathanc-n · 2025-09-26T04:33:52Z

I have been busy, I'll try my best to get to finishing this up tomorrow

jonathanc-n · 2025-09-28T23:10:06Z

@2010YOUY01 I have added the remaining changes. I did not change q3 to use generate_series 0,9000, 1000 because this would not select any rows.

2010YOUY01 · 2025-09-29T09:41:59Z

@2010YOUY01 I have added the remaining changes. I did not change q3 to use generate_series 0,9000, 1000 because this would not select any rows.

Thanks again! There are some error introduced in the latest merge main, ./bench.sh run hj is not able to run. I fixed it in 41c3cdb

alamb · 2025-09-29T17:17:49Z

🚀

feat: Add Hash Join benchmarks

64f66d9

jonathanc-n mentioned this pull request Sep 17, 2025

[EPIC]: Perfect Hash Join #17635

Closed

fmt

a8f3f72

adriangb reviewed Sep 19, 2025

View reviewed changes

Update benchmarks/README.md

06e4c63

Co-authored-by: Adrian Garcia Badaracco <[email protected]>

2010YOUY01 approved these changes Sep 19, 2025

View reviewed changes

Merge branch 'main' into add-hj-benchmarks

ed5dfd4

jonathanc-n added 2 commits September 21, 2025 00:35

add benchmarks

1ff500c

Merge branch 'add-hj-benchmarks' of https://github.com/jonathanc-n/da…

86e01cf

…tafusion into add-hj-benchmarks

2010YOUY01 reviewed Sep 21, 2025

View reviewed changes

benchmarks/src/hj.rs Outdated Show resolved Hide resolved

benchmarks/src/hj.rs Outdated Show resolved Hide resolved

2010YOUY01 reviewed Sep 21, 2025

View reviewed changes

benchmarks/src/hj.rs Outdated Show resolved Hide resolved

benchmarks/src/hj.rs Outdated Show resolved Hide resolved

jonathanc-n added 2 commits September 26, 2025 18:36

Merge branch 'main' into add-hj-benchmarks

1a96386

update selectivities

eb7173d

fix the error introduced when merging main

41c3cdb

2010YOUY01 added this pull request to the merge queue Sep 29, 2025

Merged via the queue into apache:main with commit cc157b8 Sep 29, 2025
28 checks passed

Conversation

jonathanc-n commented Sep 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jonathanc-n commented Sep 17, 2025

Uh oh!

Uh oh!

adriangb Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

jonathanc-n Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jonathanc-n commented Sep 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

jonathanc-n commented Sep 26, 2025

Uh oh!

jonathanc-n commented Sep 28, 2025

Uh oh!

2010YOUY01 commented Sep 29, 2025

Uh oh!

Uh oh!

alamb commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants