Rewrite IN (subquery) so that it can be executed as JOIN instead of CreatingSets#83991
Rewrite IN (subquery) so that it can be executed as JOIN instead of CreatingSets#83991
Conversation
3743db6 to
80832c4
Compare
aa5a628 to
68b1e3c
Compare
a5d861e to
4ad68dc
Compare
e148585 to
4be2a7f
Compare
4be2a7f to
63c89b5
Compare
| { | ||
| auto & in_second_argument = function_in_arguments_nodes[1]; | ||
|
|
||
| if (in_second_argument->as<QueryNode>()) |
There was a problem hiding this comment.
You can't check if it's a query until you resolve this node. Example:
WITH
t as (select number from numbers(10)
SELECT *
FROM numbers(20)
WHERE number in tThe second argument here will be IdentifierNode.
There was a problem hiding this comment.
added resolving the node here
| internal_exists_subquery->getProjection().getNodes().push_back(std::make_shared<IdentifierNode>(Identifier{unique_column_name})); | ||
| internal_exists_subquery->getJoinTree() = std::move(subquery_node); | ||
|
|
||
| /// SELECT 1 FROM (SELECT * AS _unique_name_ FROM subquery) WHERE a = _unique_name_ LIMIT 1 |
There was a problem hiding this comment.
added support for subqueries returning mutiple columns
| (function_name == "in" || function_name == "notIn") && | ||
| scope.context->getSettingsRef()[Setting::rewrite_in_to_join]) | ||
| { | ||
| if (!scope.context->getSettingsRef()[Setting::allow_experimental_correlated_subqueries]) |
There was a problem hiding this comment.
Maybe fallback to CreatingSets instead of exception? Both options are okay for me
There was a problem hiding this comment.
I think it's better not to silently fallback here because it might hide errors that way
|
Does this transformation work when the key There are ~90 million matching rows in the projection, and But the rest of the query is very fast because the part offsets are included in primary index analysis, and can quickly filter parts and granules. then the query takes even longer, and it appears to perform a full table scan. Is there some way to eliminate the lag of CreatingSets, but also utilize the primary key for part offsets? Then, at any size, it would be fast to join against the right side of |
I think you can try to play with this sub-query to speed it up. Does it also take 9 sec when run separately?
|
The subquery only takes ~2 seconds to complete. Then, after it's complete, the CreatingSetsTransformation takes an additional ~9 seconds to transform the result into a set for use with
Does this work for the primary index too? Notably I am not utilizing any skip indexes in the main query, just the special part offset columns. At any rate, it seems there are two separate issues:
|
| /// SELECT * AS _unique_name_ FROM subquery | ||
| auto internal_exists_subquery = std::make_shared<QueryNode>(Context::createCopy(scope.context)); | ||
| internal_exists_subquery->setIsSubquery(true); | ||
| internal_exists_subquery->getProjection().getNodes().push_back(std::make_shared<IdentifierNode>(Identifier{unique_column_name})); |
There was a problem hiding this comment.
In the future can be replaced with ColumnNode
|
|
||
| auto & copy_of_in_first_parameter = function_in_arguments_nodes[0]; | ||
|
|
||
| auto subquery_projection = std::make_shared<IdentifierNode>(Identifier{unique_column_name}); |
|
Just double checking, is this transformation going to affect queries like my earlier comment, which leverage the primary key analysis between running the subquery and the main query? If it will have a negative impact, then is the transformation optional? |
|
@EmeraldShift This transformation is optional and is disabled by default, it is controlled by |
The implementation rewrites
x IN subqueryto
EXISTS (SELECT 1 FROM (SELECT * AS _unique_name_ FROM subquery) WHERE x = _unique_name_ LIMIT 1)and the
EXISTexpression is rewritten into JOIN by de-correlation logic.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Documentation entry for user-facing changes