Skip to content

"OR" operator in "ON" section for join#21320

Merged
vdimir merged 46 commits intoClickHouse:masterfrom
arenadata:ADQM-138
Sep 29, 2021
Merged

"OR" operator in "ON" section for join#21320
vdimir merged 46 commits intoClickHouse:masterfrom
arenadata:ADQM-138

Conversation

@ilejn
Copy link
Copy Markdown
Contributor

@ilejn ilejn commented Feb 28, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Queries with JOIN ON support OR

Detailed description / Documentation draft:

...

By adding documentation, you'll allow users to try your new feature immediately, not when someone else will have time to document it later. Documentation is necessary for all features that affect user experience in any way. You can add brief documentation draft above, or add documentation right into your patch as Markdown files in docs folder.

If you are doing this for the first time, it's recommended to read the lightweight Contributing to ClickHouse Documentation guide first.

Information about CI checks: https://clickhouse.tech/docs/en/development/continuous-integration/

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Feb 28, 2021
@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Feb 28, 2021

This is Work In Progress, not ready to merge, too early to run tests.
Implements #17612

Current state

  • plenty of debug messages/comment/other nasty things, it is too early to get rid of them, sorry.
  • stateless tests are not OK after merge with recent master (LowCardinality things are problematic), worked before, hope to have it fixed soon.
  • issues with full/right join (should not be serious)
  • not suitable for production deduplication engine (see unordered_set known_rows) - the problem here is rows with more than one true disjuncts. I plan to use newly born HashTable::offsetInternal to have it O(1)
  • extra loops in joinRightColumns - plan to have two versions for one map and for several maps
    one join method for all disjuncts (fallback to Type::hashed). Although the only reason is intention to have all functions pre-instantiated, I think it is Ok

I appreciate suggestion.

@akuzm
Copy link
Copy Markdown
Contributor

akuzm commented Mar 5, 2021

@vdimir Maybe you would be interested in this.

@vdimir
Copy link
Copy Markdown
Member

vdimir commented Jun 15, 2021

@ilejn could you please give update about current status, any difficulties, plans?

Also, feel free to ask some help.

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Jun 15, 2021

@ilejn could you please give update about current status, any difficulties, plans?

Also, feel free to ask some help.

@vdimir , thank you for your care.

I do not see test failures related to the change (let me know if I am wrong), so I plan to

  • remove most of Traces
  • remove auxiliary comments and other garbage
  • create a description of what is done (and what is not)
  • remove WIP marker
  • ask for review

hopefully this week.

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Jun 17, 2021

HashJoin::RightTableData::maps is now std::vector<MapsVariant>, not just a single MapsVariant. All MapsVariant have the same type, it is possible to have several maps for e.g. Type::String or for Type::UInt32, while not together.
If types are different everything is converted to Type::hashed.

To prevent duplication if more than one disjunct it true KnownRowsHolder based on std::pair<const Block*, DB::RowRef::SizeT> uniqueness is introduced.

The small optimization we have in TableJoin::getRequiredRightKeys turned off if multiple disjuncts because the idea of "using value from left table because of equality" does not work well.
DNF::process is called in TreeRewriter::analyzeSelect to convert to Disjunctive normal form.

@ilejn ilejn changed the title WIP "OR" operator in "ON" section for join "OR" operator in "ON" section for join Jul 5, 2021
@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Jul 5, 2021

The thing still not perfect, while tests are passed so I think we can start reviewing and discussing it.
Another reason - some development in master, that makes merges more and more painful.

Copy link
Copy Markdown
Member

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for impressive work!

I left some comments to discuss. Also maybe I'll dig into code (especially HashJoin.cpp) a bit dipper and I'll add something more, but for the first look it's good.

My main points are:

  • It's good to add comments to some tricky places (like DNF, joinRightColumns, KnownRowsHolder)
  • Lets try to introduce some structure to deal with one join key (that can be compound: several expressions connected with and, same as disjunct) and work with vector of it, not vector of vector. I suppose it'll make code more readable.
  • Maybe need to add some corner cases to test, like some cases that expected to return error, tests for Storage/Dict join, test cases with nulls.

Also while reading code I've made minor style changes you can cherry-pick/merge my commits from my branch https://github.com/vdimir/ClickHouse/tree/ADQM-138

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can omit it because it required only for optimization, as you said in comment? Does everything will work correctly without it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea is if join condition is ON T1.A = T2.A it is pointless to keep in memory both T1.A and T2.A. If a row exists in resultset, we can use value of T1.A instead of T2.A.
It does not work if we have OR (in T1.A = T2.A or T1.B = T2.B it is not clear which equality is true) and we need all columns.
The original behavior is preserved because the goal was zero cost enhancement although I think that from functional point of view it is Ok to get rid of this optimization completely.
Reconfirm if I should test if it is really true.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raw pointers, not ASTPtr ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be fixed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The excuse is "why bother using ASTPtr since we use raw pointers to identify nodes anyway"?
So, it makes sense to avoid using raw pointers e.g. in TableJoin::addDisjunct, but I have no idea how.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the purpose of this method? Actually we are not adding AST passed to function to any where, it's confusing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this method is to actually create new disjunct if the node passed to the function is known as a child of previously discovered OR.
Notes: (1) tree is already converted to DNF at this point, (2) OR can have more than two children.

Imagine, that we have a tree
A------B
| ------C
| ------D
where A is OR and B, C, and D are it's children.
Traversing tree, we are going top to bottom, and we call setDisjuncts when A is hit to memorize B, C and D.
When B is hit, we actually create new disjunct.
Same for C.
Same for D.

I'll think how to facilitate the algorithm.
It seems that function names are not perfect and I am open for suggestions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key_names_right.size() == 1 should hold on calling this method? Maybe add assert or check with LOGICAL_ERROR ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check function->children.size() ?

As I understand from ASTFunction::clone function can have up to 3 children: arguments, parameters and something related to window functions (not out case).

ASTPtr ASTFunction::clone() const
{
auto res = std::make_shared<ASTFunction>(*this);
res->children.clear();
if (arguments) { res->arguments = arguments->clone(); res->children.push_back(res->arguments); }
if (parameters) { res->parameters = parameters->clone(); res->children.push_back(res->parameters); }
if (window_definition)
{
res->window_definition = window_definition->clone();
res->children.push_back(res->window_definition);
}
return res;
}

Is it correct to say that we deals with or/and functions here, that always have only one child ? Do we need to throw error if function with more that one child met?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the assumption "we need to throw error if function with more that one child met" is correct. Well spotted!
I planned to do it but did not dare.
We need this check to survive in case of garbage requests (e.g. generated by fuzzer), for example OR without children at all.
I did not add throwing error because the problem is not new, it is reproducible with ANDs, and ClickHouse somehow processes such queries without a complain.
So, yes, I confirm that we have a problem here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If in case of miltiple disjuncts we will use Type::hashed, can we create separate branch for it here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that I understand why it is helpful.
The intention is to avoid vectors if they are not needed?

Comment on lines 969 to 1042
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
template<>
class KnownRowsHolder<true>
/// Used prevent duplication if more than one disjunct it true
template<>
class KnownRowsHolder<true>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that we can have 16 disjuncts at max? I didn't see check for it, maybe I missed it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all.
We are memorizing rows from the left (I think) side of join to avoid duplicates. If we have "true OR true OR true" for a particular pair of rows, we do not want to have this pair three times.
"16" is a threshold we switch from array to set.
There is a test that covers it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this dynamic switching? Does it really affects performance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly speaking, I do not know.
If all conditions (disjuncts) produce multitude of matches for some rows, using set makes a lot of sense. Imagine, we've got million of matches from the first disjunct anf same number from the second one.
If number of matches is always small, set will be probably slower than array.
Complexity is O(x^2) for array and O(x*log(x)) for set.
I believe that it is possible to make up examples where either choice would be suboptimal, that's why I've introduced this dynamic switch.

Comment on lines 977 to 1055
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear what is for, obscure names

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Linear" and "Log" are both about algorithm complexity, we are switching from linear search to std::set based approach.
Do you think that I should substitute "Linear" and "Log" by "Array" and "Set"? Or by "Small" and "Large"?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for me Array and Set seems more obvious.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Comment on lines 1132 to 1207
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better to make row loop inner?
Also body is too long, is it possible to split it into functions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that it is possible to change "order" of loops, we try disjuncts one by one for every row.
The body of the loop is definitely too heavy. Looks like I tried to split it but did not manage, it is rather 'hot' part of the code and it is easy to kill performance.
I'll think about it.

@vdimir
Copy link
Copy Markdown
Member

vdimir commented Jul 22, 2021

Merge conflicts caused by #24420, I'm going to resolve it by myself

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Jul 22, 2021

@vdimir, what is the best way to go further? May I create a branch with suggested changes?

@vdimir
Copy link
Copy Markdown
Member

vdimir commented Jul 23, 2021

@vdimir, what is the best way to go further? May I create a branch with suggested changes?

I suppose yes. Sorry, I still haven't look at conflicts. If changes won't intersect with many conflicted fragments it's ok

@vdimir
Copy link
Copy Markdown
Member

vdimir commented Jul 30, 2021

Sorry for delay, it's harder to merge new feature into this branch than it seems. New feature from master adds support of conditional expressions in ON section, so new logic should be added carefully.

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Jul 30, 2021

Sorry for delay, it's harder to merge new feature into this branch than it seems. New feature from master adds support of conditional expressions in ON section, so new logic should be added carefully.

Nothing to be sorry about.

Actually I am busy with the same. Not done yet. I've broken something during merge and not all tests are ok.
Looking into this.

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Jul 31, 2021

Managed to have it working.
I'll do some cleanup and push to the new branch by Monday morning.

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Aug 2, 2021

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Aug 9, 2021

Tests are passed.
We have a conflict again, but it looks trivial. Fixing.

@vdimir
Copy link
Copy Markdown
Member

vdimir commented Aug 9, 2021

Tests are passed.
We have a conflict again, but it looks trivial. Fixing.

Great! The only thing I concerned about is co-existing JoinUsedFlags and BlockWithFlags for same purpose. But I still have not idea how to unify it.

@ilejn
Copy link
Copy Markdown
Contributor Author

ilejn commented Aug 9, 2021

My nearest plans: recheck array join and join engine.

Regarding JoinUsedFlags vs BlockWithFlags.

  1. It is a sort of common for CH to have coexisting algorithms and choose the best one in runtime.
  2. I am, say, 80% sure that it is a good idea to switch to BlockWithFlags completely. Actually I performed some very rough performance evaluations and haven't observed any difference.
  3. I did not dare to get rid of JoinUsedFlags because of performance considerations (or, to put it in other words, to not fight with performance tests)
  4. I am not the one to make a decision if it is possible to make further optimizations when ORs are in master, but I definitely would fill more comfortable if we take this approach.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Sep 28, 2021

CLA assistant check
All committers have signed the CLA.

@vdimir vdimir merged commit 27f0f9f into ClickHouse:master Sep 29, 2021
@sevirov
Copy link
Copy Markdown
Contributor

sevirov commented Sep 30, 2021

Internal documentation ticket: DOCSUP-15714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants