"OR" operator in "ON" section for join by ilejn · Pull Request #21320 · ClickHouse/ClickHouse

ilejn · 2021-02-28T20:54:53Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Queries with JOIN ON support OR

Detailed description / Documentation draft:

...

By adding documentation, you'll allow users to try your new feature immediately, not when someone else will have time to document it later. Documentation is necessary for all features that affect user experience in any way. You can add brief documentation draft above, or add documentation right into your patch as Markdown files in docs folder.

If you are doing this for the first time, it's recommended to read the lightweight Contributing to ClickHouse Documentation guide first.

Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/

ilejn · 2021-02-28T21:00:11Z

This is Work In Progress, not ready to merge, too early to run tests.
Implements #17612

Current state

plenty of debug messages/comment/other nasty things, it is too early to get rid of them, sorry.
stateless tests are not OK after merge with recent master (LowCardinality things are problematic), worked before, hope to have it fixed soon.
issues with full/right join (should not be serious)
not suitable for production deduplication engine (see unordered_set known_rows) - the problem here is rows with more than one true disjuncts. I plan to use newly born HashTable::offsetInternal to have it O(1)
extra loops in joinRightColumns - plan to have two versions for one map and for several maps
one join method for all disjuncts (fallback to Type::hashed). Although the only reason is intention to have all functions pre-instantiated, I think it is Ok

I appreciate suggestion.

akuzm · 2021-03-05T01:09:40Z

@vdimir Maybe you would be interested in this.

vdimir · 2021-06-15T14:20:40Z

@ilejn could you please give update about current status, any difficulties, plans?

Also, feel free to ask some help.

ilejn · 2021-06-15T14:53:29Z

@ilejn could you please give update about current status, any difficulties, plans?

Also, feel free to ask some help.

@vdimir , thank you for your care.

I do not see test failures related to the change (let me know if I am wrong), so I plan to

remove most of Traces
remove auxiliary comments and other garbage
create a description of what is done (and what is not)
remove WIP marker
ask for review

hopefully this week.

ilejn · 2021-06-17T08:25:21Z

HashJoin::RightTableData::maps is now std::vector<MapsVariant>, not just a single MapsVariant. All MapsVariant have the same type, it is possible to have several maps for e.g. Type::String or for Type::UInt32, while not together.
If types are different everything is converted to Type::hashed.

To prevent duplication if more than one disjunct it true KnownRowsHolder based on std::pair<const Block*, DB::RowRef::SizeT> uniqueness is introduced.

The small optimization we have in TableJoin::getRequiredRightKeys turned off if multiple disjuncts because the idea of "using value from left table because of equality" does not work well.
DNF::process is called in TreeRewriter::analyzeSelect to convert to Disjunctive normal form.

ilejn · 2021-07-05T08:23:19Z

The thing still not perfect, while tests are passed so I think we can start reviewing and discussing it.
Another reason - some development in master, that makes merges more and more painful.

vdimir

Thank you for impressive work!

I left some comments to discuss. Also maybe I'll dig into code (especially HashJoin.cpp) a bit dipper and I'll add something more, but for the first look it's good.

My main points are:

It's good to add comments to some tricky places (like DNF, joinRightColumns, KnownRowsHolder)
Lets try to introduce some structure to deal with one join key (that can be compound: several expressions connected with and, same as disjunct) and work with vector of it, not vector of vector. I suppose it'll make code more readable.
Maybe need to add some corner cases to test, like some cases that expected to return error, tests for Storage/Dict join, test cases with nulls.

Also while reading code I've made minor style changes you can cherry-pick/merge my commits from my branch https://github.com/vdimir/ClickHouse/tree/ADQM-138

vdimir · 2021-07-06T13:53:05Z

src/Interpreters/HashJoin.cpp

We can omit it because it required only for optimization, as you said in comment? Does everything will work correctly without it?

The original idea is if join condition is ON T1.A = T2.A it is pointless to keep in memory both T1.A and T2.A. If a row exists in resultset, we can use value of T1.A instead of T2.A.
It does not work if we have OR (in T1.A = T2.A or T1.B = T2.B it is not clear which equality is true) and we need all columns.
The original behavior is preserved because the goal was zero cost enhancement although I think that from functional point of view it is Ok to get rid of this optimization completely.
Reconfirm if I should test if it is really true.

vdimir · 2021-07-13T09:44:45Z

src/Interpreters/CollectJoinOnKeysVisitor.cpp

Why raw pointers, not ASTPtr ?

Will be fixed.

The excuse is "why bother using ASTPtr since we use raw pointers to identify nodes anyway"?
So, it makes sense to avoid using raw pointers e.g. in TableJoin::addDisjunct, but I have no idea how.

vdimir · 2021-07-13T09:51:30Z

src/Interpreters/TableJoin.cpp

What the purpose of this method? Actually we are not adding AST passed to function to any where, it's confusing

The purpose of this method is to actually create new disjunct if the node passed to the function is known as a child of previously discovered OR.
Notes: (1) tree is already converted to DNF at this point, (2) OR can have more than two children.

Imagine, that we have a tree
A------B
| ------C
| ------D
where A is OR and B, C, and D are it's children.
Traversing tree, we are going top to bottom, and we call setDisjuncts when A is hit to memorize B, C and D.
When B is hit, we actually create new disjunct.
Same for C.
Same for D.

I'll think how to facilitate the algorithm.
It seems that function names are not perfect and I am open for suggestions.

vdimir · 2021-07-13T10:17:21Z

src/Interpreters/TableJoin.h

key_names_right.size() == 1 should hold on calling this method? Maybe add assert or check with LOGICAL_ERROR ?

vdimir · 2021-07-13T10:31:50Z

src/Interpreters/TreeRewriter.cpp

Why do we check function->children.size() ?

As I understand from ASTFunction::clone function can have up to 3 children: arguments, parameters and something related to window functions (not out case).

ClickHouse/src/Parsers/ASTFunction.cpp

Lines 101 to 116 in fc783be

ASTPtr ASTFunction::clone() const

{

auto res = std::make_shared<ASTFunction>(*this);

res->children.clear();

if (arguments) { res->arguments = arguments->clone(); res->children.push_back(res->arguments); }

if (parameters) { res->parameters = parameters->clone(); res->children.push_back(res->parameters); }

if (window_definition)

{

res->window_definition = window_definition->clone();

res->children.push_back(res->window_definition);

}

return res;

}

Is it correct to say that we deals with or/and functions here, that always have only one child ? Do we need to throw error if function with more that one child met?

I think that the assumption "we need to throw error if function with more that one child met" is correct. Well spotted!
I planned to do it but did not dare.
We need this check to survive in case of garbage requests (e.g. generated by fuzzer), for example OR without children at all.
I did not add throwing error because the problem is not new, it is reproducible with ANDs, and ClickHouse somehow processes such queries without a complain.
So, yes, I confirm that we have a problem here.

vdimir · 2021-07-13T11:52:44Z

src/Interpreters/HashJoin.cpp

If in case of miltiple disjuncts we will use Type::hashed, can we create separate branch for it here?

Not sure that I understand why it is helpful.
The intention is to avoid vectors if they are not needed?

vdimir · 2021-07-13T11:55:46Z

src/Interpreters/HashJoin.cpp

Suggested change

template<>

class KnownRowsHolder<true>

/// Used prevent duplication if more than one disjunct it true

template<>

class KnownRowsHolder<true>

vdimir · 2021-07-13T11:56:52Z

src/Interpreters/HashJoin.cpp

Does it mean that we can have 16 disjuncts at max? I didn't see check for it, maybe I missed it

Not at all.
We are memorizing rows from the left (I think) side of join to avoid duplicates. If we have "true OR true OR true" for a particular pair of rows, we do not want to have this pair three times.
"16" is a threshold we switch from array to set.
There is a test that covers it.

Do we actually need this dynamic switching? Does it really affects performance?

Frankly speaking, I do not know.
If all conditions (disjuncts) produce multitude of matches for some rows, using set makes a lot of sense. Imagine, we've got million of matches from the first disjunct anf same number from the second one.
If number of matches is always small, set will be probably slower than array.
Complexity is O(x^2) for array and O(x*log(x)) for set.
I believe that it is possible to make up examples where either choice would be suboptimal, that's why I've introduced this dynamic switch.

vdimir · 2021-07-13T11:57:45Z

src/Interpreters/HashJoin.cpp

Not clear what is for, obscure names

"Linear" and "Log" are both about algorithm complexity, we are switching from linear search to std::set based approach.
Do you think that I should substitute "Linear" and "Log" by "Array" and "Set"? Or by "Small" and "Large"?

As for me Array and Set seems more obvious.

vdimir · 2021-07-13T12:57:19Z

src/Interpreters/HashJoin.cpp

Maybe it's better to make row loop inner?
Also body is too long, is it possible to split it into functions?

I do not think that it is possible to change "order" of loops, we try disjuncts one by one for every row.
The body of the loop is definitely too heavy. Looks like I tried to split it but did not manage, it is rather 'hot' part of the code and it is easy to kill performance.
I'll think about it.

vdimir · 2021-07-22T07:36:26Z

Merge conflicts caused by #24420, I'm going to resolve it by myself

ilejn · 2021-07-22T14:41:45Z

@vdimir, what is the best way to go further? May I create a branch with suggested changes?

vdimir · 2021-07-23T16:15:18Z

@vdimir, what is the best way to go further? May I create a branch with suggested changes?

I suppose yes. Sorry, I still haven't look at conflicts. If changes won't intersect with many conflicted fragments it's ok

vdimir · 2021-07-30T10:12:50Z

Sorry for delay, it's harder to merge new feature into this branch than it seems. New feature from master adds support of conditional expressions in ON section, so new logic should be added carefully.

ilejn · 2021-07-30T10:20:06Z

Sorry for delay, it's harder to merge new feature into this branch than it seems. New feature from master adds support of conditional expressions in ON section, so new logic should be added carefully.

Nothing to be sorry about.

Actually I am busy with the same. Not done yet. I've broken something during merge and not all tests are ok.
Looking into this.

ilejn · 2021-07-31T22:08:53Z

Managed to have it working.
I'll do some cleanup and push to the new branch by Monday morning.

ilejn · 2021-08-02T08:30:59Z

@vdimir , see https://github.com/arenadata/ClickHouse/tree/ADQM-138_review_1

ilejn · 2021-08-09T08:07:07Z

Tests are passed.
We have a conflict again, but it looks trivial. Fixing.

vdimir · 2021-08-09T10:49:32Z

Tests are passed.
We have a conflict again, but it looks trivial. Fixing.

Great! The only thing I concerned about is co-existing JoinUsedFlags and BlockWithFlags for same purpose. But I still have not idea how to unify it.

ilejn · 2021-08-09T11:16:11Z

My nearest plans: recheck array join and join engine.

Regarding JoinUsedFlags vs BlockWithFlags.

It is a sort of common for CH to have coexisting algorithms and choose the best one in runtime.
I am, say, 80% sure that it is a good idea to switch to BlockWithFlags completely. Actually I performed some very rough performance evaluations and haven't observed any difference.
I did not dare to get rid of JoinUsedFlags because of performance considerations (or, to put it in other words, to not fight with performance tests)
I am not the one to make a decision if it is possible to make further optimizations when ORs are in master, but I definitely would fill more comfortable if we take this approach.

… via OR

CLAassistant · 2021-09-28T11:12:27Z

All committers have signed the CLA.

sevirov · 2021-09-30T16:57:57Z

Internal documentation ticket: DOCSUP-15714

robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Feb 28, 2021

akuzm added the can be tested label Mar 5, 2021

vdimir self-assigned this Mar 5, 2021

den-crane mentioned this pull request Apr 2, 2021

Query with JOIN ON does not support OR #22509

Closed

ilejn force-pushed the ADQM-138 branch from 0d64e86 to aeb8e9a Compare April 5, 2021 19:27

ilejn force-pushed the ADQM-138 branch from bf870cb to bef33be Compare June 25, 2021 12:09

ilejn force-pushed the ADQM-138 branch from bef33be to d3b136a Compare July 4, 2021 22:10

ilejn changed the title ~~WIP "OR" operator in "ON" section for join~~ "OR" operator in "ON" section for join Jul 5, 2021

vdimir reviewed Jul 13, 2021

View reviewed changes

vdimir mentioned this pull request Jul 30, 2021

JOIN condition optimization #26928

Open

This was referenced Aug 3, 2021

Some optimizations for constant conditions in JOIN ON #27021

Merged

Support join on constant #25894

Merged

ilejn force-pushed the ADQM-138 branch from d3b136a to 31ad026 Compare August 3, 2021 21:07

vdimir and others added 22 commits September 28, 2021 14:11

fix

0a9a028

fix rebase collisions in ORs in JOIN

300eb50

Minor changes related to JOIN ON ORs

71b6c94

Use table_join->getAllNames in HashJoin.cpp

f8e8f6d

Add join_on_or_long.sql

212ba1b

Not implemented for asof and auto join with multiple ORs

3b35ab6

optimizeDisjuncts in ORs in JOIN

637ff19

necessary test changes for optimizeDisjuncts in ORs in JOIN

4c043a0

crash fix, style fixes, ASTs moved out of TableJoin in ORs in JOIN

8057e05

minor merge mistakes fixed in ORs in JOIN

fa6c2a6

MAX_ORS, checkStackSize and beautification per review in ORs in JOIN

78ad6bf

MAX_DISJUNCTS instead of MAX_ORS in ORs in JOIN

6daef66

checkStackSize moved to the top of DNF::distributed in ORs in JOIN

aa4751a

rebase collisions fixed in ORs in JOIN

29b911f

DNF bugfix in ORs in JOIN

17e6cfb

Do not allow in optimizeClauses conditions for different table joined…

760a92c

… via OR

bypass filer conditions in DNF in ORs in JOIN (part 1)

336b2a4

bypass filer conditions in DNF in ORs in JOIN (part 2)

bbd548e

compatible filter conditions, fixes and new tests in ORs in JOIN

626bfdf

fix bug found by PVS in ORs in JOIN

1dc7fc5

get rid of DNF and related features in ORs in JOIN

7ebc16c

minor fixes in ORs in JOIN

d67bc0b

ilejn force-pushed the ADQM-138 branch from 1f0de00 to d67bc0b Compare September 28, 2021 11:12

vdimir approved these changes Sep 28, 2021

View reviewed changes

vdimir merged commit 27f0f9f into ClickHouse:master Sep 29, 2021

ilejn mentioned this pull request Oct 12, 2021

"OR" operator in "ON" section for join #17612

Closed

UnamedRus mentioned this pull request Nov 12, 2021

JOIN ON with OR conditions to check for NULL values Altinity/tableau-connector-for-clickhouse#14

Closed

kirillikoff mentioned this pull request Dec 26, 2021

DOCSUP-15714: OR operator in ON section for JOIN #33197

Merged

	ASTPtr ASTFunction::clone() const
	{
	auto res = std::make_shared<ASTFunction>(*this);
	res->children.clear();

	if (arguments) { res->arguments = arguments->clone(); res->children.push_back(res->arguments); }
	if (parameters) { res->parameters = parameters->clone(); res->children.push_back(res->parameters); }

	if (window_definition)
	{
	res->window_definition = window_definition->clone();
	res->children.push_back(res->window_definition);
	}

	return res;
	}

-template<>
-class KnownRowsHolder<true>
+/// Used prevent duplication if more than one disjunct it true
+template<>
+class KnownRowsHolder<true>

Conversation

ilejn commented Feb 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilejn commented Feb 28, 2021

Uh oh!

akuzm commented Mar 5, 2021

Uh oh!

vdimir commented Jun 15, 2021

Uh oh!

ilejn commented Jun 15, 2021

Uh oh!

ilejn commented Jun 17, 2021 • edited by vdimir Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilejn commented Jul 5, 2021

Uh oh!

vdimir left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdimir commented Jul 22, 2021

Uh oh!

ilejn commented Jul 22, 2021

Uh oh!

vdimir commented Jul 23, 2021

Uh oh!

vdimir commented Jul 30, 2021

Uh oh!

ilejn commented Jul 30, 2021

Uh oh!

ilejn commented Jul 31, 2021

Uh oh!

ilejn commented Aug 2, 2021

ilejn commented Feb 28, 2021 •

edited

Loading

ilejn commented Jun 17, 2021 •

edited by vdimir

Loading

vdimir left a comment •

edited

Loading

CLAassistant commented Sep 28, 2021 •

edited

Loading