-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Projection (RFC) #14730
Description
Projections are like materialized views, but defined in part-level. It provides consistency guarantees along with automatic usage in queries. This RFC proposes a way to implement projections in ClickHouse.
Projection Grammars:
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [compression_codec] [TTL expr1],
...
PROJECTION projection_name_1 (SELECT <COLUMN LIST EXPR> [WHERE] [GROUP BY] [ORDER BY]),
...
) ENGINE = engine ...
-
ALTER TABLE [db.]table ADD PROJECTION name AS SELECT <COLUMN LIST EXPR> [WHERE] [GROUP BY] [ORDER BY]; -
ALTER TABLE [db.]table DROP PROJECTION name; -
ALTER TABLE [db.]table MATERIALIZE PROJECTION name IN PARTITION partition_name; -
ALTER TABLE [db.]table CLEAR PROJECTION name IN PARTITION partition_name;
Projection Query:
A projection query is what defines a projection. It has the following grammar: SELECT <COLUMN LIST EXPR> [WHERE] [GROUP BY] [ORDER BY]. It implicitly selects data from the parent table. Aliases might be disabled first for simplicity.
Projection Storage:
Projections are stored inside the part directory. It's similar to an index but contains an subdirectory which stores an anonymous MergeTree table's part. The table is induced by the definition query of the projection. If there is a GROUP BY clause, the underlying storage engine becomes AggregatedMergeTree, and all aggregate functions are converted to either AggregateFunction or SimpleAggregateFunction. If there is an ORDER BY clause, the MergeTree table will use it as its primary key expression. During the merge process, the projection part will be merged via its storage's merge routine. The checksum of the parent table's part will combine the projection's part. Other maintenance jobs are similar to skip indices.
Replication
Metadata in ZooKeeper might be updated to replicate the newly added DDL actions. GET_PART will be extended to send and receive projection subdirectories.
Query Routing Process:
-
Check if the projection contains all the needed columns and rows.
-
If it's an aggregated projection, also check if it has the right columns inside the
GROUP BYclause along with required aggregate functions. -
If it's an sorted projection, also check how many granules will be selected by the KeyCondition.
-
Select the best feasible match.
-
The query pipeline which uses projections will be different from the one that uses the original parts. if the projection is absent in some parts, we can add the pipeline to "project" it on the fly.
Projection Analysis and Settings:
-
EXPLAIN PROJECTION should show which projection is selected, perhaps we can also add other information
-
Introduce some settings to disable creation of expensive or meaningless projections.
-
Record the resource consumption when inserting and merging projections.
Questions:
-
Is it possible to have old instances running when doing a grayscale upgrade?
-
Shall we have different projections on different replicas (heterogenous)? If so, what information should be stored in ZooKeeper?
-
What qualifies the best feasible projection match? Is it possible to select multiple different projections in one query?
-
Last but not least. Is "Projection" a good name for this feature?