feat: Add Hash Join benchmarks#17636
Conversation
|
cc @2010YOUY01 |
benchmarks/src/hj.rs
Outdated
| // equality on key + cheap filter to downselect | ||
| r#" | ||
| SELECT t1.value, t2.value | ||
| FROM range(10000) AS t1 |
There was a problem hiding this comment.
I feel it'd be better to generate actual parquet files. That's a more realistic scenario and would slow testing things like sideways information passing
There was a problem hiding this comment.
I was copying over the format and some docs from the mini benchmark for nlj. The original reason the smaller dfbenches were added were to just test the pure execution of the hash join for specific comparisons (ex. perfect hash join, testing new hash algorithms, etc). We could allow for benchmarks with both data from range() and parquet files. If so we should do a follow up for adding parquet files for both HJ and NLJ benches. WDYT @2010YOUY01
There was a problem hiding this comment.
The scope of this micro-benchmark is indeed limited, but I think it’s better to keep the focus on the join executor and eliminate the cost of Parquet reading. Perhaps we could add some join-focused queries by extending the TPCH or JOB benchmark for realistic scenarios.
There was a problem hiding this comment.
Okay sounds good, let's proceed as is
Co-authored-by: Adrian Garcia Badaracco <[email protected]>
2010YOUY01
left a comment
There was a problem hiding this comment.
Thank you, it looks good in general.
For extra bench coverage, I think we can enumerate some workloads (perhaps just modify some existing queries) with additional join filters, given they're also evaluated inside HashJoin executor:
ON (t1.value = t2.value) AND (<expr with col in both t1 and t2>)
e.g.
ON (t1.value = t2.value) AND ((t1.value+t2.value)%10 = 0)
benchmarks/src/hj.rs
Outdated
| // equality on key + cheap filter to downselect | ||
| r#" | ||
| SELECT t1.value, t2.value | ||
| FROM range(10000) AS t1 |
There was a problem hiding this comment.
The scope of this micro-benchmark is indeed limited, but I think it’s better to keep the focus on the join executor and eliminate the cost of Parquet reading. Perhaps we could add some join-focused queries by extending the TPCH or JOB benchmark for realistic scenarios.
|
I added the new queries and fixed the query. Should be good to go! |
|
Shall we merge this one? Or are we still waiting to address comments? |
|
I have been busy, I'll try my best to get to finishing this up tomorrow |
|
@2010YOUY01 I have added the remaining changes. I did not change q3 to use generate_series 0,9000, 1000 because this would not select any rows. |
Thanks again! There are some error introduced in the latest merge main, |
|
🚀 |
Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?