-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Describe how identifiers in SELECT queries are resolved #23194
Description
Query analysis in ClickHouse is more complicated than it is in standard SQL due to the following extensions:
- aliases can be defined and used in any part of the query;
- expressions can be used in any part of the query;
- there is support for complex columns with subcolumns (although standard SQL also has some sort of it);
- there is ARRAY JOIN;
- lambda functions and higher order functions.
These extensions are very convenient for writing SQL queries we don't regret of them :)
But it is not convenient from implementation standpoint - many things are in different places in code and are underspecified. It makes further development difficult.
Let's try to describe how identifiers in SELECT queries are resolved.
Then we will have a chance to implement it in more consistent, easy to maintain fashion.
I will try to describe how it works right now and how it is supposed to work after upcoming extensions.
The text below is only for ClickHouse experts.
- Identifier can correspond to a column in a table in section FROM:
SELECT column FROM table
- Identifier can be unqualified, qualified by table or table alias:
SELECT table.column FROM table
SELECT t.column FROM table AS t
- A table can be qualified with database name:
SELECT database.table.column FROM database.table
- A question: should we allow to qualify column identifier with default database if it's not qualified in table identifier?
SELECT database.table.column FROM table
- In case of multiple tables in a query, unqualified identifer will be searched in every table. If it is found in only one table, it will be resolved to this table. In case of ambiguity, exception should be thrown.
SELECT t1.x, t2.x, y FROM (SELECT x, y) AS t1, (SELECT x, z) AS t2
- If query does not contain FROM section, it behaves the same way as if
FROM system.oneis present.
SELECT 1;
SELECT dummy;
SELECT one.dummy;
SELECT system.one.dummy;
This query also works:
SELECT *
But we may remove the support for asterisk expansion in case if section FROM is omitted.
This query works in current version of ClickHouse, but we are not proud of this fact:
SELECT ***
- Identifier can consist of multiple components and can be resolved to a column or a subcolumn of arbitrarily nested path or a table.
SELECT * FROM t WHERE x IN table;
SELECT * FROM t WHERE x IN db.table;
SELECT * FROM t WHERE dictHas(table, x);
SELECT * FROM t WHERE joinGet(db.table, x);
In this example, x is resolved to a column in table t, and table, db.table and t are resolved to tables.
SELECT nest.key.subkey FROM t
In this example, nest.key.subkey is subcolumn of complex column nest. Ambiguity is possible. In case of ambiguity, exception should be thrown.
- Table expression in FROM section can contain arbitrary number of ARRAY JOIN or LEFT ARRAY JOIN expressions.
On the left is a table and on the right is a column of Array type.
SELECT * FROM (SELECT [1, 2, 3] AS arr) ARRAY JOIN arr
A column to array join can be specified by identifier (arr in the example above) or as immediate array (literal):
SELECT * FROM table ARRAY JOIN [1, 2, 3] AS arr
If it is specified as literal, alias is required. A question: we may not require alias.
A column to array join can be resolved to complex column (nested data structure) as well:
CREATE TEMPORARY TABLE test1 (a String, nest Nested(x String, y String));
SELECT a, nest.* FROM test1 ARRAY JOIN nest;
SELECT a, n.* FROM test1 ARRAY JOIN nest AS n;
Nested data type is a syntactic sugar for Array(Tuple(...)):
CREATE TEMPORARY TABLE test1 (a String, nest Array(Tuple(x String, y String)));
SELECT a, nest.* FROM test1 ARRAY JOIN nest;
SELECT a, n.* FROM test1 ARRAY JOIN nest AS n;
If alias is specified for the right hand side of array join: ARRAY JOIN arr AS joined, the joined alias will reference to array elements and the original name arr will reference to the original value.
If alias is not specified for the right hand side of array join: ARRAY JOIN arr, the original name arr (can be qualified or not) will reference to the array elements and the original arrays cannot be referenced.
SELECT table.nest FROM table ARRAY JOIN nest
- Aliases can be specified for every expression in a query. Within the boundaries of one SELECT query/subquery, alias is globally visible, even before it's declaration.
SELECT 1 AS x, x, x + 1;
SELECT x, x + 1, 1 AS x;
SELECT x, 1 + (2 + (3 AS x));
- Aliases defined outside of subquery are not visible in subqueries (but see below).
Unless they are specified in WITH clause and enable_global_with_statement is set.
SELECT 123 AS x FROM (SELECT a, x FROM (SELECT 1 AS a, 2 AS b)); -- invalid query, identifier `x` in subquery is not found.
SELECT 123 AS x, (SELECT x) AS y; -- does not work even for scalar subqueries.
SELECT 123 AS x, 123 IN (SELECT x); -- does not work even for IN subquery either.
SET enable_global_with_statement = 1; -- this is by default, BTW.
WITH 123 AS x SELECT 1 FROM (SELECT a, x FROM (SELECT 1 AS a, 2 AS b)); -- ok, identifier `x` in subquery is defined in WITH section.
-- here we refer to table `t` (defined as subquery) three times, one of them inside another scalar subquery.
WITH t AS (SELECT 1) SELECT t, (SELECT * FROM t) FROM t;
We want to change it in the following way:
- aliases and columns from the outside will be visible inside subqueries;
- ambiguity will be resolved by scoping - the nearest name to the current scope is selected;
- columns in current scope are preferred for disambiguation to aliases from the outer scope;
- this is needed to allow correlated subqueries;
- the visibility of outer aliases inside FROM subqueries is mostly pointless but still usable for scalar values;
- Aliases defined in subquery are not visible outside of subquery (but see clarification below).
SELECT x FROM (SELECT y FROM table WHERE (1 AS x) = y) AS t; -- throws, because x is not visible outside.
SELECT t.x FROM (SELECT * FROM (SELECT 1 AS x) AS t); -- throws, because the table name `t` is not visible outside.
SELECT x FROM (SELECT * FROM (SELECT 1 AS x) AS t); -- ok.
A question: should we allow to refer to aliases from subqueries by the names qualified by subquery name?
SELECT t.x FROM (SELECT a, a + (1 AS x) AS b) t
- Only the names of SELECT expression (a.k.a projection names, correlation names) are "exported" outside as table columns:
SELECT t.x FROM (SELECT 1 AS x) AS t;
SELECT t.x FROM (SELECT x FROM tbl) AS t;
SELECT x FROM (SELECT x FROM tbl) AS t;
SELECT tbl.x FROM (SELECT x FROM tbl) AS t; -- this is wrong, the `tbl` name is not exported
SELECT t2.x FROM (SELECT x FROM tbl AS t2) AS t; -- this is also wrong, the `t2` alias is not exported
Nothing is "exported" from scalar subqueries or IN subqueries, subquery expressions:
SELECT x, (SELECT 1 AS x); -- does not work, `x` is not visible;
SELECT x IN (SELECT 1 AS x); -- does not work either;
SELECT x IN (SELECT 1 AS x) FROM (SELECT 1 AS x); -- this will work, but keep in mind that there are two different `x`.
- Aliases can be ambiguous with column names. In case of ambiguity inside the boundaries of one query/subquery, it is resolved according to the value of setting
prefer_column_name_to_alias.
SELECT x + 1 AS x, x FROM (SELECT 1 AS x);
This query will return either 2, 1 or 2, 2 depending on the value of prefer_column_name_to_alias.
SELECT x, x + 1 AS x FROM (SELECT 1 AS x);
The order of alias usage and alias definition does not matter (this may be a subject to change).
- If more than one alias with the same name are defined inside the boundaries of one query/subquery and these aliases refers to semantically different expressions, exception should be thrown.
SELECT 1 AS x, 2 AS x; -- throws exception.
- Identifier can appear as a formal parameter in a lambda function:
SELECT arrayMap(x -> x + 1, arr)
Here x is a formal parameter.
Parameters to lambda functions are scoped. Their names can be ambiguous to other identifiers. In case of ambiguity, parameter of lambda function is preferred. The nearest to the current scope parameter is preferred.
SELECT x, arrayMap((x, y) -> x[1] + y + arrayFirst(x -> x != y, x), arr) FROM (SELECT 1 AS x, [([1, 2], 3), ([4, 5], 6)] AS arr);
This query disambiguated as follows:
SELECT x1, arrayMap((x2, y2) -> x2[1] + y2 + arrayFirst(x3 -> x3 != y2, x2), arr) FROM (SELECT 1 AS x1, [([1, 2], 3), ([4, 5], 6)] AS arr);
In expression x -> expr(x) the scope of x is expr(x). So, the expression arrayMap(x -> x, x) should be read as arrayMap(x1 -> x1, x2).
- Aliases should refer to identifiers by their semantical meaning, not by the name or syntactic construction.
If alias referes to an expression that is using a name that is unavailable in the place where alias is substituted, exception should be thrown.
SELECT arrayMap(x -> [y * 2, (x + 1) AS y, 1 AS z], arr), y;
The last y refers to an expression that depends on x that is formal parameter of lambda function and is not available out of its scope.
SELECT arrayMap(x -> [y * 2, (x + 1) AS y, 1 AS z], arr), z;
This will work as alias z is not scoped.
SELECT arrayMap(x -> (x + 1) AS y, arr1), arrayMap(x -> (x || 'hello') AS y, arr);
This should throw exception as aliases y refers to different expressions.
The scope of formal parameter x is not related to any aliases that are using x.
- If identifier can be resolved to a table or a column in a table, ambiguity is resolved in the following way:
In right hand side of ARRAY JOIN it is resolved to a column.
In other parts of FROM it is resolved to a table.
In all other places it is resolved to a column. This includes right hand side of IN operator, dictGet, joinGet.
SELECT x FROM x ARRAY JOIN x WHERE x IN x
This is disambiguated as follows:
SELECT col FROM table ARRAY JOIN col WHERE col IN col
Note: we can make more cases when we simply throw exception in case of ambiguity.
- In case of ambiguity with subcolumns of complex columns, exception should be thrown.
The previous rule can be theoretically applied to subcolumns of complex columns as well:
CREATE TABLE db.tbl
(
col String,
db Nested
(
tbl Nested
(
col String
)
)
)
ENGINE = Memory;
SELECT db.tbl.col FROM db.tbl
In this example, db.tbl.col can be resolved to the name of subcolumn of the Nested column.
But it will be much better to simply throw exception in all cases of ambiguity with subcolumns.
- Asterisk or qualified asterisk can be specified in any place where a list of expressions is possible.
SELECT * FROM table; -- ok
SELECT f(*) FROM table; -- ok
SELECT * + * FROM table; -- ok, though looks weird
SELECT * GROUP BY *; -- ok
SELECT * ORDER BY *; -- not ok as every component of ORDER BY may contain ASC/DESC and COLLATE; though can be supported in some sense
SELECT * WHERE *; -- not ok as WHERE contains single expression, not a list
- Unqualified asterisk is expanded to all the columns from all tables in section FROM in current scope; without virtual (hidden) columns; without ALIAS and MATERIALIZED columns unless it's changed by the setting
asterisk_include_materialized_columns,asterisk_include_alias_columns.
In case of ambiguous column names in tables, asterisk is expanded to an identifier qualified by table name (and if table name is ambiguous, also by database name). In case of no ambiguity, unqualified name is used.
SELECT * FROM (SELECT a) AS t, (SELECT b) AS u;
-- equivalent to:
SELECT a, b FROM (SELECT a) AS t, (SELECT b) AS u;
SELECT * FROM (SELECT a) AS t, (SELECT a) AS u;
-- equivalent to:
SELECT t.a, u.a FROM (SELECT a) AS t, (SELECT a) AS u;
SELECT * FROM (SELECT a) AS db1.t, (SELECT a) AS db2.t;
-- equivalent to:
SELECT db1.t.a, db2.t.a FROM (SELECT a) AS db1.t, (SELECT a) AS db2.t;
- Qualified asterisk is expanded to all the columns from a table (with the same exceptions as for unqualified asterisk) or all subcolumns of a complex column or subcolumn of a table.
A question. Should we allow asterisk qualified by database only?
SELECT db.* FROM db.t1, db.t2; -- it's in question whether this should work.
SELECT nest.* FROM table; -- expansion of complex column.
SELECT nest FROM table; -- you can select all complex column at once as a single tuple / array of tuples.
For compatibility reasons, asterisk expansion also expands all complex columns recursively, so they are flattened into multiple columns. This maybe a subject for adjustment by settings.
CREATE TABLE t
(
x String,
nest Nested
(
a String,
b String
)
) ENGINE = Memory;
SELECT * FROM t;
-- equivalent to:
SELECT x, nest.* FROM t;
-- equivalent to:
SELECT x, nest.a, nest.b FROM t;
- Qualified asterisk as well as named tuple access operator can be applied to expressions.
SELECT generateTuple().1; -- simple tuple access operator
SELECT generateTuple().hello; -- named tuple or complex column access operator - can be applied to Nested type as well as Array of named Tuple
SELECT generateTuple().*; -- expansion of a tuple or complex column with asterisk
SELECT CAST(('hello', 1) AS Tuple(name String, count UInt32)).*
- Untuple expression is an alternative way to write asterisk expansion.
SELECT untuple(CAST(('hello', 1) AS Tuple(name String, count UInt32))); -- will give two columns `name` and `count`.
- Asterisk expansion can be modified by column transformers expressions.
SELECT * EXCEPT('hello|world');
SELECT t.* EXCEPT(hello, world);
SELECT db.t.* REPLACE(x + 1 AS x);
Column transformers can be chained:
SELECT * EXCEPT(hello) REPLACE(x + 1 AS x);
COLUMNSexpression or qualifiedCOLUMNSexpression can be used in place of asterisk or qualified asterisk. It is processed similarly but with the specified filter.
SELECT COLUMNS('^test_') FROM table;
SELECT t.COLUMNS('^test_') FROM t, u;
SELECT t.COLUMNS('^test_') EXCEPT (test_hello, test_world) FROM t, u;
The filter in form of regular expression is applied to unqualified names from the corresponding tables.
- Expressions with identical "correlation names" inside subqueries are not allowed.
SELECT * FROM (SELECT x, x FROM (SELECT 1 AS x)); -- will throw;
SELECT x FROM (SELECT x, x FROM (SELECT 1 AS x)); -- will throw;
SELECT 1 FROM (SELECT x, x FROM (SELECT 1 AS x)); -- will throw;
- Unnamed expressions inside subqueries are automatically named in implementation specific way.
SELECT `plus(1, 2)` FROM (SELECT 1 + 2); -- this works for now but not guaranteed.
A question. We can leave this automatic names only for outermost query and simply not allow to refer unnamed expressions from subqueries.
- Identifiers can be resolved to functions. Lambda expressions can be aliased. (proposal)
SELECT arrayMap(plus, arr1, arr2);
SELECT x -> x + 1 AS fun, arrayMap(fun, arr);
This is needed to allow SQL UDFs.
Aggregate functions can be found as well:
SELECT arrayReduce(sum, arr);
-- currently works only this way:
SELECT arrayReduce('sum', arr);
In case of ambiguity, function is less preferred than alias or column name or table name.
- Aliases specified inside table function expression are not visible outside of table function expression.
SELECT x FROM mysql(('host' AS x) || '123', port, db, table); -- does not work as `x` is not visible outside.
- Table may have ALIAS columns.
These columns can be referred as usual columns, but they are expanded during query analysis (for better possibilities of optimization).
CREATE TEMPORARY TABLE t
(
x UInt8,
y ALIAS x + 1
);
SELECT y FROM t;
If an expression of ALIAS column defines other aliases, these aliases are only visible inside this expression.
CREATE TEMPORARY TABLE t
(
x UInt8,
y ALIAS ((x + 1) AS z) + 1
);
SELECT x, y, z FROM t; -- does not work, because `z` is not visible.
-- this is correct:
CREATE TEMPORARY TABLE t
(
x UInt8,
y ALIAS z + 1,
z ALIAS x + 1
);
ALIAS columns may refer other ALIAS columns in the same table. The order of ALIAS columns does not matter for that.