Skip to content

Refreshable Materialized Views #33919

@alexey-milovidov

Description

@alexey-milovidov

Use Case

Automate periodic imports of data into ClickHouse. Automate batch jobs for data transformation.

Requirements

It should run a query with some schedule and atomically replace a table with the result.

Schedule can be aligned to calendar like "3 AM every Thursday" or not aligned like "30 seconds after previous import has been finished".

It can contain some concurrency limits like "don't do more than 5 jobs of this type concurrently". Limits should work within named "concurrency group". The job also can have numeric priority to control what to run first when resources are available.

If it was not possible to run the job due to concurrency limits or if server was not up, the setting should control if it needs to run as soon as possible or skip to the next update time if more than some ratio of time before next update already passed.

The settings for randomization to avoid DoS of the sources like "spread the next update time around three minutes".

Error recovery: do up to N retries with some initial timeout and exponential backoff saturated up to another timeout.

Chaining and dependencies between the jobs: run a job only after other specified jobs are finished.

System table for introspection of these jobs and system commands to temporary stop scheduling, or stop or pause already running job, or run the job out of order.

It should be possible to run clickhouse-server with all the jobs disabled before manual command.

Out of scope

Incremental data import into partitions by time. Batch jobs with multiple queries.
We can think how to extend this mechanism to something like "every day do import and replace partitions for three previous days".

Coordination of jobs in a cluster. (Should be easily extended with schedule in Keeper).

Backing up old tables after refresh (e.g. store N backups in renamed tables), can be easily implemented.

Proposed solution

CREATE MATERIALIZED VIEW name

REFRESH 

    EVERY 1 WEEK /* on Mondays */
    | EVERY 1 WEEK OFFSET 2 DAY 3 HOUR /* Wednedsday 3 AM */
    | EVERY 30 SECOND /* on 00 and 30 second every minute */
    | AFTER 30 SECOND /* unaligned, 30 seconds after previous update has been finished */
    
    RANDOMIZE FOR 3 MINUTE /* spread uniformly random plus minus 1.5 minutes */
    
DEPENDS ON other_view1, other_view2...  /* run after subsequent finish of these views, only for AFTER */
    
SETTINGS /* fine settings for a refresh process */
    max_concurrency = 3,
    concurrency_group = 'import_from_mysql',
    priority = 1,
    max_retries = 5,
    backoff_min_timeout = 10,
    backoff_max_timeout = 600,
    backoff_exponent = 2,
    skip_update_after_seconds = 1000,
    skip_update_after_ratio = 0.5

ENGINE = /* how to store data */

AS SELECT /* what query to run, possibly including the SETTINGS for the query */
SELECT * FROM system.view_refreshes;
SYSTEM REFRESH VIEW name;    /* ask for refresh out of schedule */
SYSTEM STOP|START VIEW name; /* do not run this view on schedule */
SYSTEM CANCEL|PAUSE|START VIEW name; /* for currently running job */

ALTER, DROP, RENAME, DETACH, ATTACH work trivially for these views.
ALTER is similar to CREATE and simply changes everything (the underlying table itself will be replaced on next update).

The underlying table has the same name as the view (the view and its data are represented with the same name in the database).

Additional context

With the help of Atomic databases, table replace is lock-free.

Refreshable materialized views will simplify managing some dictionaries. For example, we can create a view to import data from MySQL to EmbeddedRocksDB table and then create a direct dictionary on top of this table.

Note: this task is not hard to implement, but it contains many minor details and gotchas.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions