Materialized CTE #61086

canhld94 · 2024-03-08T10:11:55Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow a common table expression to be materialized as temporary table

Syntax (inspired by Postgres): WITH t AS MATERIALIZED (subquery).
The implement is basic now, mostly influenced by implementation of GLOBAL JOIN.

Only support materializing CTE at query level (sub-query CTEs will not be materialized)
Temporary table can have Memory (default) or Join engine.

Close #53449

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

Signed-off-by: Duc Canh Le <[email protected]>

robot-clickhouse-ci-1 · 2024-03-08T15:13:19Z

This is an automated comment for commit 790ba94 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	⏳ pending
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	⏳ pending
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	❌ failure
Mergeable Check	Checks if all other necessary checks are successful	❌ failure
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	❌ failure

Successful checks

Check name	Description	Status
A Sync	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
PR Check	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success

Signed-off-by: Duc Canh Le <[email protected]>

adamshugar · 2024-03-16T21:47:50Z

Just want to say this feature is incredibly necessary! We have CTEs being re-run all over the place, but it would be super cumbersome and error-prone to extract them all into temp tables. Keep up the awesome work, really excited to see this merged :)

Signed-off-by: Duc Canh Le <[email protected]>

baptistejamin · 2024-03-22T12:48:26Z

Awesome work!

Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>

Signed-off-by: Duc Canh Le <[email protected]>

…eeded for index Signed-off-by: Duc Canh Le <[email protected]>

Signed-off-by: Duc Canh Le <[email protected]>

…ted query Signed-off-by: Duc Canh Le <[email protected]>

Signed-off-by: Duc Canh Le <[email protected]>

…e for index analysis Signed-off-by: Duc Canh Le <[email protected]>

…subquery Signed-off-by: Duc Canh Le <[email protected]>

canhld94 · 2024-04-18T08:18:12Z

Hi @kitaisreal @KochetovNicolai can you help to take a look at this PR? Briefly about the implementation (same for analyzer and non-analyzer):

CTEs with MATERIALIZED are collected (currently at AST level) and remove from the query AST, corresponding temporary tables will be registered in query context.
Interpreter builds main query pipeline
Interpreter add a delayed pipeline to materialize the temporary tables from CTEs before executing the main pipeline; an example of query pipeline:

EXPLAIN PIPELINE
WITH t1 AS MATERIALIZED (SELECT number AS c FROM numbers(5)), t2 AS MATERIALIZED (SELECT number AS c FROM numbers(5))
SELECT * FROM t1
UNION ALL
SELECT * FROM t2;

(DelayedMaterializingCTEs)
DelayedPorts 4 → 2
  (MaterializingCTE)
  MaterializingCTETransform
    (Expression)
    ExpressionTransform
      (ReadFromSystemNumbers)
      NumbersRange 0 → 1
  (MaterializingCTE)
  MaterializingCTETransform
    (Expression)
    ExpressionTransform
      (ReadFromSystemNumbers)
      NumbersRange 0 → 1
  (Union)
    (Expression)
    ExpressionTransform
      (ReadFromMemoryStorage)
    (Expression)
    ExpressionTransform
      (ReadFromMemoryStorage)

Table from CTE can be materialized on-demand needed at query analyzer time (e.g. for index, compute scalars..); in this case it will not contribute to the delayed pipeline.
The downside of this approach is: materialized CTE doesn't have scope, so it cannot be in sub-query. CTE in subquery, even with MATERIALIZED keyword, will always be substituted.
The underlying table engine for CTE can be Memory (default), Join, or Set

WDYT about this? For us, this is sufficient (and I think this is sufficient for most users) but I can make a more sophisticated implementation (with proper scope for materialized CTE) if needed.

Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>

adamshugar · 2024-04-18T17:08:28Z

@canhld94 This is amazing work - just wanted to speak to our use case that would be made a lot more straightforward with properly scoped materialized CTEs.

We have a query builder / BI tool type web dashboard for analyzing cloud usage and cost, with which the user can construct complicated queries. Relevant features include "compare to a prior period" and "aggregate cost across multiple usage types or cloud services". So in our backend, the constructed queries look structurally something like this (although they can get significantly more nested):

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
)
SELECT ...

We intentionally construct one very complex query rather than multiple sequential queries for a couple reasons:

By doing everything in a single query, we support atomicity/transactions, depending on other ClickHouse settings
Significant reduction in overall query time AND increase in reliability (relative to storing intermediate results in temporary in memory tables)
Significantly less code complexity (relative to intermediate temp in-mem tables)
Pulling out intermediate results into memory of our backend service is untenable given the scale at which we operate

To make our use case work with this materialized CTE implementation, we could in theory extract all CTEs at every level of nesting to the top level and prefix them (e.g. prior_period_cost_s3). However this would be an extensive refactor that in my opinion goes against the natural "composability" of our BI tool's interface/structure.

Maybe I'm overlooking something or there's an alternative approach we could take with our query builder?

canhld94 · 2024-04-22T11:27:54Z

@adamshugar yes, it is possible to implement properly scoped CTE with new analyzer. But I hardly see any case that materialized cte in deep subquery will bring benefit.
In your example:

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
)
SELECT ...

It looks like you only need prior_period and curr_period to be materialized?

adamshugar · 2024-04-24T21:36:04Z

@canhld94 Apologies, I should've given more detail. In our case:

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
)
SELECT ...

This may seem a little contrived, and I'm happy to share some more concrete examples. But the queries can get quite complex and I don't want to add too much irrelevant information.

Distilled to its essence, our use case is: you have an expensive-to-compute CTE (e.g. cost_s3 or cost_ec2) that itself is the input to multiple other CTEs/subqueries (e.g. total_cost_transform_1, total_cost_transform_2), making the expensive-to-compute CTE a great candidate for MATERIALIZED. The problem is that all of this is wrapped in a larger CTE (e.g. prior_period or curr_period) that can't be extracted into its own query for the reasons stated in my earlier comment.

Please let me know if I'm misunderstanding something.

Regardless, just wanted to offer an example in support of the scoped version. But we can make it work either way, and what you've built already is massively helpful. 🙏

canhld94 · 2024-04-25T10:44:25Z

@canhld94 Apologies, I should've given more detail. In our case:
WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
)
SELECT ...
This may seem a little contrived, and I'm happy to share some more concrete examples. But the queries can get quite complex and I don't want to add too much irrelevant information.

Distilled to its essence, our use case is: you have an expensive-to-compute CTE (e.g. cost_s3 or cost_ec2) that itself is the input to multiple other CTEs/subqueries (e.g. total_cost_transform_1, total_cost_transform_2), making the expensive-to-compute CTE a great candidate for MATERIALIZED. The problem is that all of this is wrapped in a larger CTE (e.g. prior_period or curr_period) that can't be extracted into its own query for the reasons stated in my earlier comment.

Please let me know if I'm misunderstanding something.

Regardless, just wanted to offer an example in support of the scoped version. But we can make it work either way, and what you've built already is massively helpful. 🙏

@adamshugar Thanks for the example, I understand your point.
I agree that properly scoped CTE will be helpful in your case, and in general it will make the feature easier to use for more users. Nevertheless, I want to hear some opinions from the core team first. We need to have a trade-off for usability and code complexity.
cc @alexey-milovidov as well.

Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>

canhld94 · 2024-05-01T11:16:13Z

Just an update, after checking code today I think it's possible to implement materialized CTE with properly scope using new analyzer. Also it's possible to implement it as an optimisation: even without MATERIALIZED keyword, if a CTE is referred more than once then we can materialize it.

I will start working on this soon. Nevertheless, if you only need materialised CTE at query level, this PR will work w/o issue (we're running it now).

- Decouple CTE from QueryNode and UnionNode, a CTE will be represented in query tree by a CTENode with one child is the QueryNode or UnionNode - Add MaterializeCTEPass that will materialize the CTENode - TODO: add CTEScanStep as a query plan step and CTESource Signed-off-by: Duc Canh Le <[email protected]>

marbemac · 2024-05-15T13:38:09Z

Just wanted to echo @adamshugar's comments. This change would be so powerful. My use case would also benefit greatly from scoped CTEs. I think I'm in a similar boat as @adamshugar - the product we're working on dynamically builds up queries for the end users, and we use CTEs to break the query up into parts which are referenced in several places, including other CTEs.

Also it's possible to implement it as an optimisation: even without MATERIALIZED keyword, if a CTE is referred more than once then we can materialize it.

Quick thought here - while the super majority of cases would likely benefit from materialized CTEs, there's at least one use case where I would not want the CTE to be materialized: when working with any of the random functions, for example to create mock data. IMHO automatically materializing is also just riskier in general.

@canhld94 does the latest support scoped CTEs? Eager to try it out and see how it impacts our use cases.

marbemac · 2024-05-16T23:32:09Z

Is the latest commit expected to be working? Built things locally to experiment, and I get the following error:

WITH all_traits AS MATERIALIZED
  (
    SELECT number FROM numbers(100)
  )
SELECT * FROM all_traits

Error: ERROR at Line 1: : DB::Exception: Table expression all_traits AS MATERIALIZED (SELECT number FROM numbers(100)) data must be initialized. In scope WITH all_traits AS MATERIALIZED (SELECT number FROM numbers(100)) SELECT * FROM all_traits

Adding Memory:

WITH all_traits AS MATERIALIZED
  (
    SELECT number FROM numbers(100)
  ) MEMORY
SELECT * FROM all_traits

Error: DB::Exception: Syntax error: failed at position 77 ('MEMORY') (line 4, col 5): MEMORY SELECT * FROM all_traits. Expected one of: ENGINE, token, Comma, FROM, SELECT

Thanks for working on this one, I really think it could be a great feature.

jledentu · 2024-10-08T08:40:00Z

Hi,

Any news @canhld94 ? Thank you for working on this 🙏 , this PR seems to be so promising! My use case is similar to @adamshugar's: our engine dynamically generates multiple CTEs in order to calculate formulas from various metrics, some CTEs potentially being used in other CTEs. When the number of CTEs is low, ClickHouse is highly faster than our previous Postgres implementation. But for queries containing a large number of CTEs used several times, queries are considerably slower (1s -> 10s sometimes).

I think that this PR would help a lot.

jirislav · 2024-11-29T07:38:19Z

Very helpful @canhld94, thanks! What needs to happen to have this upstream?

From what I can see:

Use the new analyzer. Is this neccesary?
Rebase (the branch is more than 19k commits behind).
Get attention of someone able to do code review and merge it. Could you @divanik please help us out with finding someone? 🙏🏼

The PR itself is so big I would suggest implementing the below mentioned points as separate PRs:

Supporting sub-query CTE materialization.
Use optimization to automatically decide on CTE materialization.

marbemac · 2025-01-27T18:40:44Z

Love that this is included in the 2025 roadmap @alexey-milovidov! Any idea re priority / amount of effort left beyond this PR? @canhld94 are you guys using this PR or a version of this pr at ahrefs by any chance, is it still important to you?

canhld94 · 2025-01-28T02:44:53Z

Love that this is included in the 2025 roadmap @alexey-milovidov! Any idea re priority / amount of effort left beyond this PR? @canhld94 are you guys using this PR or a version of this pr at ahrefs by any chance, is it still important to you?

@marbemac Yes, we're using this feature in our prod and it's really cool. Nevertheless, because we want it to work with both old and new analyzer, the implementation still has some limitations, e.g. materialized CTE does't work in sub-query.

We're migrating to new analyzer, and it should be done before end of Q1. After that we have plan to re-work this feature in new analyzer. So, yes, we will try to make it to upstream (we're 1-man army team now, so it's a bit busy tbh).

clickhouse-gh · 2025-09-09T13:18:43Z

Dear @novikd, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

marbemac · 2025-09-10T15:30:04Z

hi @canhld94 - just circling back around to see if you guys are still using this? and wondering if it'll make it in in 2025 🙏

canhld94 added 2 commits March 8, 2024 10:02

initial implementaion

094bdf4

Signed-off-by: Duc Canh Le <[email protected]>

executing CTEs in parallel, allow custom table engine

c04f5ff

Signed-off-by: Duc Canh Le <[email protected]>

yariks5s added the can be tested Allows running workflows for external contributors label Mar 8, 2024

robot-clickhouse-ci-1 added the pr-feature Pull request with new product feature label Mar 8, 2024

canhld94 added 4 commits March 11, 2024 06:46

fixing bugs, add Set to custom CTE engines

be5af5c

Signed-off-by: Duc Canh Le <[email protected]>

fix failing CI tests

22eae18

Signed-off-by: Duc Canh Le <[email protected]>

suport using materialized cte table in joinGet

0941913

Signed-off-by: Duc Canh Le <[email protected]>

add test and initial document

45326dd

Signed-off-by: Duc Canh Le <[email protected]>

canhld94 added 2 commits March 19, 2024 14:47

fix uninitialized subquery depth

216f783

Signed-off-by: Duc Canh Le <[email protected]>

try fix style check

853fd14

Signed-off-by: Duc Canh Le <[email protected]>

canhld94 mentioned this pull request Mar 20, 2024

Add multiquery support to HTTP API #61608

Open

canhld94 added 10 commits April 1, 2024 09:56

Merge branch 'master' into materialzied_cte

f26376c

Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>

wip

35aea8a

Signed-off-by: Duc Canh Le <[email protected]>

single pipeline initial working impl

0b5b737

Signed-off-by: Duc Canh Le <[email protected]>

materialize future tables during query plan optimization if they're n…

ee0211d

…eeded for index Signed-off-by: Duc Canh Le <[email protected]>

implement materialized cte for analyzer

e247255

Signed-off-by: Duc Canh Le <[email protected]>

auto detect when to send materialized CTE to remote shard in distribu…

d1d6c89

…ted query Signed-off-by: Duc Canh Le <[email protected]>

fix for distributed query with analyzer

8af643f

Signed-off-by: Duc Canh Le <[email protected]>

add more tests

354a2f1

Signed-off-by: Duc Canh Le <[email protected]>

fix build

3d5bde3

Signed-off-by: Duc Canh Le <[email protected]>

fix style

a685d32

Signed-off-by: Duc Canh Le <[email protected]>

canhld94 force-pushed the materialzied_cte branch from 4c3ba37 to a685d32 Compare April 1, 2024 13:21

canhld94 added 3 commits April 2, 2024 02:37

fixing stateless test

7ad610d

Signed-off-by: Duc Canh Le <[email protected]>

add new setting change

05d2bea

Signed-off-by: Duc Canh Le <[email protected]>

Merge branch 'master' into materialzied_cte

460d4b9

canhld94 changed the title ~~WIP: Materialized CTE~~ Materialized CTE Apr 2, 2024

use proper cast

41fec7a

Signed-off-by: Duc Canh Le <[email protected]>

canhld94 added 4 commits April 4, 2024 02:51

fix explain query segment fault

21287c9

Signed-off-by: Duc Canh Le <[email protected]>

Merge branch 'master' into materialzied_cte

9a4da29

fix duplicate building set when cte is used in both where and prewher…

20a6e53

…e for index analysis Signed-off-by: Duc Canh Le <[email protected]>

materialize CTE at query analysis if it's needed for evaluate scalar …

f34f8cb

…subquery Signed-off-by: Duc Canh Le <[email protected]>

canhld94 force-pushed the materialzied_cte branch from 0671b27 to f34f8cb Compare April 8, 2024 10:28

Merge branch 'master' into materialzied_cte

46878b2

Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>

canhld94 mentioned this pull request Apr 19, 2024

Aggregation function that respects the order of data #62777

Closed

Merge branch 'master' into materialzied_cte

10bf5bf

Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>

brian-mann mentioned this pull request Jun 15, 2024

Allow support for either multiple queries or more direct control of the agent to ensure queries are sent via the same socket session ClickHouse/clickhouse-js#283

Closed

canhld94 mentioned this pull request Nov 27, 2024

Implement optional Common Table Expression Materialization #72015

Closed

alexey-milovidov mentioned this pull request Dec 31, 2024

Roadmap 2025 #74046

Closed

76 tasks

novikd self-assigned this Aug 4, 2025

clickhouse-gh bot unassigned novikd Sep 9, 2025

alexey-milovidov mentioned this pull request Dec 31, 2025

ClickHouse roadmap 2026 (draft) #93288

Open

72 tasks

novikd self-assigned this Jan 9, 2026

Materialized CTE #61086

Are you sure you want to change the base?

Materialized CTE #61086

Uh oh!

Conversation

canhld94 commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

robot-clickhouse-ci-1 commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamshugar commented Mar 16, 2024

Uh oh!

baptistejamin commented Mar 22, 2024

Uh oh!

canhld94 commented Apr 18, 2024

Uh oh!

adamshugar commented Apr 18, 2024

Uh oh!

canhld94 commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamshugar commented Apr 24, 2024

Uh oh!

canhld94 commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

canhld94 commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marbemac commented May 15, 2024

Uh oh!

marbemac commented May 16, 2024

Uh oh!

jledentu commented Oct 8, 2024

Uh oh!

jirislav commented Nov 29, 2024

Uh oh!

marbemac commented Jan 27, 2025

Uh oh!

canhld94 commented Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot commented Sep 9, 2025

Uh oh!

marbemac commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

canhld94 commented Mar 8, 2024 •

edited

Loading

robot-clickhouse-ci-1 commented Mar 8, 2024 •

edited

Loading

canhld94 commented Apr 22, 2024 •

edited

Loading

canhld94 commented Apr 25, 2024 •

edited

Loading

canhld94 commented May 1, 2024 •

edited

Loading

canhld94 commented Jan 28, 2025 •

edited

Loading