-
Notifications
You must be signed in to change notification settings - Fork 2k
Closed
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Further optimize the hash join algorithm
Describe the solution you'd like
There are a couple of optimizations we could implement:
- Vectorize the row-equality check which now uses the
equal_rowsfunctions. We should be able to speed this up by vectorizing this, and also specialize it for handling non-null batches too. We probably can utilize the kernelstakeandeqhere. - Don't use a
Hashmapbut aVec(or similar) with a certain amount of buckets (proportional to the number of rows or the expected number of keys in the left side). I tried this, but as it causes much more collisions than we have currently, it causes a big (3x) slowdown, so vectorizing the collision check is a prerequisite.
Additional context
https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestperformanceMake DataFusion fasterMake DataFusion faster