"OR" operator in "ON" section for join#21320
Conversation
|
This is Work In Progress, not ready to merge, too early to run tests. Current state
I appreciate suggestion. |
|
@vdimir Maybe you would be interested in this. |
|
@ilejn could you please give update about current status, any difficulties, plans? Also, feel free to ask some help. |
@vdimir , thank you for your care. I do not see test failures related to the change (let me know if I am wrong), so I plan to
hopefully this week. |
|
To prevent duplication if more than one disjunct it true The small optimization we have in |
|
The thing still not perfect, while tests are passed so I think we can start reviewing and discussing it. |
There was a problem hiding this comment.
Thank you for impressive work!
I left some comments to discuss. Also maybe I'll dig into code (especially HashJoin.cpp) a bit dipper and I'll add something more, but for the first look it's good.
My main points are:
- It's good to add comments to some tricky places (like
DNF,joinRightColumns,KnownRowsHolder) - Lets try to introduce some structure to deal with one join key (that can be compound: several expressions connected with
and, same as disjunct) and work with vector of it, not vector of vector. I suppose it'll make code more readable. - Maybe need to add some corner cases to test, like some cases that expected to return error, tests for Storage/Dict join, test cases with nulls.
Also while reading code I've made minor style changes you can cherry-pick/merge my commits from my branch https://github.com/vdimir/ClickHouse/tree/ADQM-138
src/Interpreters/HashJoin.cpp
Outdated
There was a problem hiding this comment.
We can omit it because it required only for optimization, as you said in comment? Does everything will work correctly without it?
There was a problem hiding this comment.
The original idea is if join condition is ON T1.A = T2.A it is pointless to keep in memory both T1.A and T2.A. If a row exists in resultset, we can use value of T1.A instead of T2.A.
It does not work if we have OR (in T1.A = T2.A or T1.B = T2.B it is not clear which equality is true) and we need all columns.
The original behavior is preserved because the goal was zero cost enhancement although I think that from functional point of view it is Ok to get rid of this optimization completely.
Reconfirm if I should test if it is really true.
There was a problem hiding this comment.
The excuse is "why bother using ASTPtr since we use raw pointers to identify nodes anyway"?
So, it makes sense to avoid using raw pointers e.g. in TableJoin::addDisjunct, but I have no idea how.
src/Interpreters/TableJoin.cpp
Outdated
There was a problem hiding this comment.
What the purpose of this method? Actually we are not adding AST passed to function to any where, it's confusing
There was a problem hiding this comment.
The purpose of this method is to actually create new disjunct if the node passed to the function is known as a child of previously discovered OR.
Notes: (1) tree is already converted to DNF at this point, (2) OR can have more than two children.
Imagine, that we have a tree
A------B
| ------C
| ------D
where A is OR and B, C, and D are it's children.
Traversing tree, we are going top to bottom, and we call setDisjuncts when A is hit to memorize B, C and D.
When B is hit, we actually create new disjunct.
Same for C.
Same for D.
I'll think how to facilitate the algorithm.
It seems that function names are not perfect and I am open for suggestions.
src/Interpreters/TableJoin.h
Outdated
There was a problem hiding this comment.
key_names_right.size() == 1 should hold on calling this method? Maybe add assert or check with LOGICAL_ERROR ?
src/Interpreters/TreeRewriter.cpp
Outdated
There was a problem hiding this comment.
Why do we check function->children.size() ?
As I understand from ASTFunction::clone function can have up to 3 children: arguments, parameters and something related to window functions (not out case).
ClickHouse/src/Parsers/ASTFunction.cpp
Lines 101 to 116 in fc783be
Is it correct to say that we deals with or/and functions here, that always have only one child ? Do we need to throw error if function with more that one child met?
There was a problem hiding this comment.
I think that the assumption "we need to throw error if function with more that one child met" is correct. Well spotted!
I planned to do it but did not dare.
We need this check to survive in case of garbage requests (e.g. generated by fuzzer), for example OR without children at all.
I did not add throwing error because the problem is not new, it is reproducible with ANDs, and ClickHouse somehow processes such queries without a complain.
So, yes, I confirm that we have a problem here.
src/Interpreters/HashJoin.cpp
Outdated
There was a problem hiding this comment.
If in case of miltiple disjuncts we will use Type::hashed, can we create separate branch for it here?
There was a problem hiding this comment.
Not sure that I understand why it is helpful.
The intention is to avoid vectors if they are not needed?
src/Interpreters/HashJoin.cpp
Outdated
There was a problem hiding this comment.
| template<> | |
| class KnownRowsHolder<true> | |
| /// Used prevent duplication if more than one disjunct it true | |
| template<> | |
| class KnownRowsHolder<true> |
src/Interpreters/HashJoin.cpp
Outdated
There was a problem hiding this comment.
Does it mean that we can have 16 disjuncts at max? I didn't see check for it, maybe I missed it
There was a problem hiding this comment.
Not at all.
We are memorizing rows from the left (I think) side of join to avoid duplicates. If we have "true OR true OR true" for a particular pair of rows, we do not want to have this pair three times.
"16" is a threshold we switch from array to set.
There is a test that covers it.
There was a problem hiding this comment.
Do we actually need this dynamic switching? Does it really affects performance?
There was a problem hiding this comment.
Frankly speaking, I do not know.
If all conditions (disjuncts) produce multitude of matches for some rows, using set makes a lot of sense. Imagine, we've got million of matches from the first disjunct anf same number from the second one.
If number of matches is always small, set will be probably slower than array.
Complexity is O(x^2) for array and O(x*log(x)) for set.
I believe that it is possible to make up examples where either choice would be suboptimal, that's why I've introduced this dynamic switch.
src/Interpreters/HashJoin.cpp
Outdated
There was a problem hiding this comment.
Not clear what is for, obscure names
There was a problem hiding this comment.
"Linear" and "Log" are both about algorithm complexity, we are switching from linear search to std::set based approach.
Do you think that I should substitute "Linear" and "Log" by "Array" and "Set"? Or by "Small" and "Large"?
There was a problem hiding this comment.
As for me Array and Set seems more obvious.
src/Interpreters/HashJoin.cpp
Outdated
There was a problem hiding this comment.
Maybe it's better to make row loop inner?
Also body is too long, is it possible to split it into functions?
There was a problem hiding this comment.
I do not think that it is possible to change "order" of loops, we try disjuncts one by one for every row.
The body of the loop is definitely too heavy. Looks like I tried to split it but did not manage, it is rather 'hot' part of the code and it is easy to kill performance.
I'll think about it.
|
Merge conflicts caused by #24420, I'm going to resolve it by myself |
|
@vdimir, what is the best way to go further? May I create a branch with suggested changes? |
I suppose yes. Sorry, I still haven't look at conflicts. If changes won't intersect with many conflicted fragments it's ok |
|
Sorry for delay, it's harder to merge new feature into this branch than it seems. New feature from master adds support of conditional expressions in ON section, so new logic should be added carefully. |
Nothing to be sorry about. Actually I am busy with the same. Not done yet. I've broken something during merge and not all tests are ok. |
|
Managed to have it working. |
|
Tests are passed. |
Great! The only thing I concerned about is co-existing |
|
My nearest plans: recheck array join and join engine. Regarding JoinUsedFlags vs BlockWithFlags.
|
|
Internal documentation ticket: DOCSUP-15714 |
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Queries with JOIN ON support OR
Detailed description / Documentation draft:
...
By adding documentation, you'll allow users to try your new feature immediately, not when someone else will have time to document it later. Documentation is necessary for all features that affect user experience in any way. You can add brief documentation draft above, or add documentation right into your patch as Markdown files in docs folder.
If you are doing this for the first time, it's recommended to read the lightweight Contributing to ClickHouse Documentation guide first.
Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/