Assorted ideas to slightly improve JOINs

Only ideas that are rather easy to implement.

1. Add `IColumn::shrinkToFit` method. It will remove overallocation of columns and save memory (for hash join) up to 2x. Use this method in `HashJoin` unconditionally.

2. For CROSS JOIN (nested loops): compress blocks in memory if there is large amount of data. Uncompress while joining (repeatedly for every iteration of the outer loop). The implementation is very easy after #20168.

3. For CROSS JOIN (nested loops): write blocks to tmp directory in Native format (similar to external sorting and external aggregation) if there is large amount of data. Read many times while joining.

4. Compress blocks for hash join in memory. While joining, maintain LRU cache of uncompressed blocks. Can work good if JOIN is skewed, otherwise questionable.

5. If the amount of data is large, serialize all records on disk in `RowBinary` format and keep offsets in hash table (we will have 8 bytes per record + key size + hash table overhead instead of keeping all data). While joining, do batch reads with AIO and also maintain small LRU hash table in memory. The performance can be decent (1 million IOPS on modern SSD).

6. If the amount of data is large, replace HashJoin to SSDCacheDictionary (the performance of SSDCacheDictionary assumed to be decent).

7. Represent the data structure for right hand side of JOIN or IN as a table for key-value requests. When doing distributed JOIN, instead of usual (broadcast or shuffle) algorithms, do lookup requests over network. Right hand side of distributed JOIN can be represented as a special kind of distributed table with local cache of lookup results. Applicability is limited but can be good for some scenarios: large rhs table but small subset of keys are JOINed.

8. For INNER and RIGHT JOIN try to use the set of keys of rhs table as an index for lhs table, similar to IN.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assorted ideas to slightly improve JOINs #21047

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Assorted ideas to slightly improve JOINs #21047

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions