Skip to content

Vectorize hash join collision check #50

@Dandandan

Description

@Dandandan

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Further optimize the hash join algorithm

Describe the solution you'd like
There are a couple of optimizations we could implement:

  • Vectorize the row-equality check which now uses the equal_rows functions. We should be able to speed this up by vectorizing this, and also specialize it for handling non-null batches too. We probably can utilize the kernels take and eq here.
  • Don't use a Hashmap but a Vec (or similar) with a certain amount of buckets (proportional to the number of rows or the expected number of keys in the left side). I tried this, but as it causes much more collisions than we have currently, it causes a big (3x) slowdown, so vectorizing the collision check is a prerequisite.

Additional context

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/
https://dare.uva.nl/search?identifier=5ccbb60a-38b8-4eeb-858a-e7735dd37487

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformanceMake DataFusion faster

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions