Skip to content

feat: Add Hash Join benchmarks#17636

Merged
2010YOUY01 merged 9 commits intoapache:mainfrom
jonathanc-n:add-hj-benchmarks
Sep 29, 2025
Merged

feat: Add Hash Join benchmarks#17636
2010YOUY01 merged 9 commits intoapache:mainfrom
jonathanc-n:add-hj-benchmarks

Conversation

@jonathanc-n
Copy link
Contributor

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@jonathanc-n
Copy link
Contributor Author

cc @2010YOUY01

// equality on key + cheap filter to downselect
r#"
SELECT t1.value, t2.value
FROM range(10000) AS t1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it'd be better to generate actual parquet files. That's a more realistic scenario and would slow testing things like sideways information passing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was copying over the format and some docs from the mini benchmark for nlj. The original reason the smaller dfbenches were added were to just test the pure execution of the hash join for specific comparisons (ex. perfect hash join, testing new hash algorithms, etc). We could allow for benchmarks with both data from range() and parquet files. If so we should do a follow up for adding parquet files for both HJ and NLJ benches. WDYT @2010YOUY01

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this micro-benchmark is indeed limited, but I think it’s better to keep the focus on the join executor and eliminate the cost of Parquet reading. Perhaps we could add some join-focused queries by extending the TPCH or JOB benchmark for realistic scenarios.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay sounds good, let's proceed as is

Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it looks good in general.

For extra bench coverage, I think we can enumerate some workloads (perhaps just modify some existing queries) with additional join filters, given they're also evaluated inside HashJoin executor:
ON (t1.value = t2.value) AND (<expr with col in both t1 and t2>)
e.g.
ON (t1.value = t2.value) AND ((t1.value+t2.value)%10 = 0)

// equality on key + cheap filter to downselect
r#"
SELECT t1.value, t2.value
FROM range(10000) AS t1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of this micro-benchmark is indeed limited, but I think it’s better to keep the focus on the join executor and eliminate the cost of Parquet reading. Perhaps we could add some join-focused queries by extending the TPCH or JOB benchmark for realistic scenarios.

@jonathanc-n
Copy link
Contributor Author

I added the new queries and fixed the query. Should be good to go!

@alamb
Copy link
Contributor

alamb commented Sep 25, 2025

Shall we merge this one? Or are we still waiting to address comments?

@jonathanc-n
Copy link
Contributor Author

I have been busy, I'll try my best to get to finishing this up tomorrow

@jonathanc-n
Copy link
Contributor Author

@2010YOUY01 I have added the remaining changes. I did not change q3 to use generate_series 0,9000, 1000 because this would not select any rows.

@2010YOUY01
Copy link
Contributor

@2010YOUY01 I have added the remaining changes. I did not change q3 to use generate_series 0,9000, 1000 because this would not select any rows.

Thanks again! There are some error introduced in the latest merge main, ./bench.sh run hj is not able to run. I fixed it in 41c3cdb

@2010YOUY01 2010YOUY01 added this pull request to the merge queue Sep 29, 2025
Merged via the queue into apache:main with commit cc157b8 Sep 29, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 29, 2025

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants