Skip to content

Conversation

@canhld94
Copy link
Contributor

@canhld94 canhld94 commented Mar 8, 2024

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Allow a common table expression to be materialized as temporary table

Syntax (inspired by Postgres): WITH t AS MATERIALIZED (subquery).
The implement is basic now, mostly influenced by implementation of GLOBAL JOIN.

  • Only support materializing CTE at query level (sub-query CTEs will not be materialized)
  • Temporary table can have Memory (default) or Join engine.

Close #53449

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@yariks5s yariks5s added the can be tested Allows running workflows for external contributors label Mar 8, 2024
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-feature Pull request with new product feature label Mar 8, 2024
@robot-clickhouse-ci-1
Copy link
Contributor

robot-clickhouse-ci-1 commented Mar 8, 2024

This is an automated comment for commit 790ba94 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR⏳ pending
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process⏳ pending
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report❌ failure
Successful checks
Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success

@adamshugar
Copy link

Just want to say this feature is incredibly necessary! We have CTEs being re-run all over the place, but it would be super cumbersome and error-prone to extract them all into temp tables. Keep up the awesome work, really excited to see this merged :)

Signed-off-by: Duc Canh Le <[email protected]>
@baptistejamin
Copy link

Awesome work!

canhld94 added 10 commits April 1, 2024 09:56
Resolve conflicts

Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
@canhld94 canhld94 changed the title WIP: Materialized CTE Materialized CTE Apr 2, 2024
Signed-off-by: Duc Canh Le <[email protected]>
@canhld94
Copy link
Contributor Author

Hi @kitaisreal @KochetovNicolai can you help to take a look at this PR? Briefly about the implementation (same for analyzer and non-analyzer):

  • CTEs with MATERIALIZED are collected (currently at AST level) and remove from the query AST, corresponding temporary tables will be registered in query context.
  • Interpreter builds main query pipeline
  • Interpreter add a delayed pipeline to materialize the temporary tables from CTEs before executing the main pipeline; an example of query pipeline:
EXPLAIN PIPELINE
WITH t1 AS MATERIALIZED (SELECT number AS c FROM numbers(5)), t2 AS MATERIALIZED (SELECT number AS c FROM numbers(5))
SELECT * FROM t1
UNION ALL
SELECT * FROM t2;

(DelayedMaterializingCTEs)
DelayedPorts 4 → 2
  (MaterializingCTE)
  MaterializingCTETransform
    (Expression)
    ExpressionTransform
      (ReadFromSystemNumbers)
      NumbersRange 0 → 1
  (MaterializingCTE)
  MaterializingCTETransform
    (Expression)
    ExpressionTransform
      (ReadFromSystemNumbers)
      NumbersRange 0 → 1
  (Union)
    (Expression)
    ExpressionTransform
      (ReadFromMemoryStorage)
    (Expression)
    ExpressionTransform
      (ReadFromMemoryStorage)
  • Table from CTE can be materialized on-demand needed at query analyzer time (e.g. for index, compute scalars..); in this case it will not contribute to the delayed pipeline.
  • The downside of this approach is: materialized CTE doesn't have scope, so it cannot be in sub-query. CTE in subquery, even with MATERIALIZED keyword, will always be substituted.
  • The underlying table engine for CTE can be Memory (default), Join, or Set

WDYT about this? For us, this is sufficient (and I think this is sufficient for most users) but I can make a more sophisticated implementation (with proper scope for materialized CTE) if needed.

Resolve conflicts

Signed-off-by: Duc Canh Le <[email protected]>
@adamshugar
Copy link

@canhld94 This is amazing work - just wanted to speak to our use case that would be made a lot more straightforward with properly scoped materialized CTEs.

We have a query builder / BI tool type web dashboard for analyzing cloud usage and cost, with which the user can construct complicated queries. Relevant features include "compare to a prior period" and "aggregate cost across multiple usage types or cloud services". So in our backend, the constructed queries look structurally something like this (although they can get significantly more nested):

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
)
SELECT ...

We intentionally construct one very complex query rather than multiple sequential queries for a couple reasons:

  1. By doing everything in a single query, we support atomicity/transactions, depending on other ClickHouse settings
  2. Significant reduction in overall query time AND increase in reliability (relative to storing intermediate results in temporary in memory tables)
  3. Significantly less code complexity (relative to intermediate temp in-mem tables)
  4. Pulling out intermediate results into memory of our backend service is untenable given the scale at which we operate

To make our use case work with this materialized CTE implementation, we could in theory extract all CTEs at every level of nesting to the top level and prefix them (e.g. prior_period_cost_s3). However this would be an extensive refactor that in my opinion goes against the natural "composability" of our BI tool's interface/structure.

Maybe I'm overlooking something or there's an alternative approach we could take with our query builder?

@canhld94
Copy link
Contributor Author

canhld94 commented Apr 22, 2024

@adamshugar yes, it is possible to implement properly scoped CTE with new analyzer. But I hardly see any case that materialized cte in deep subquery will bring benefit.
In your example:

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... )
    SELECT ...
)
SELECT ...

It looks like you only need prior_period and curr_period to be materialized?

@adamshugar
Copy link

@canhld94 Apologies, I should've given more detail. In our case:

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
)
SELECT ...

This may seem a little contrived, and I'm happy to share some more concrete examples. But the queries can get quite complex and I don't want to add too much irrelevant information.

Distilled to its essence, our use case is: you have an expensive-to-compute CTE (e.g. cost_s3 or cost_ec2) that itself is the input to multiple other CTEs/subqueries (e.g. total_cost_transform_1, total_cost_transform_2), making the expensive-to-compute CTE a great candidate for MATERIALIZED. The problem is that all of this is wrapped in a larger CTE (e.g. prior_period or curr_period) that can't be extracted into its own query for the reasons stated in my earlier comment.

Please let me know if I'm misunderstanding something.

Regardless, just wanted to offer an example in support of the scoped version. But we can make it work either way, and what you've built already is massively helpful. 🙏

@canhld94
Copy link
Contributor Author

canhld94 commented Apr 25, 2024

@canhld94 Apologies, I should've given more detail. In our case:

WITH prior_period AS ( 
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
),
curr_period AS (
    WITH cost_s3 AS ( ... ), cost_ec2 AS ( ... ),
        total_cost_transform_1 AS ( ... ), [total_cost_transform_1 uses both cost_s3 and cost_ec2]
        total_cost_transform_2 AS ( ... ), [total_cost_transform_2 uses both cost_s3 and cost_ec2]
    SELECT ... [UNIONs both total cost transforms]
)
SELECT ...

This may seem a little contrived, and I'm happy to share some more concrete examples. But the queries can get quite complex and I don't want to add too much irrelevant information.

Distilled to its essence, our use case is: you have an expensive-to-compute CTE (e.g. cost_s3 or cost_ec2) that itself is the input to multiple other CTEs/subqueries (e.g. total_cost_transform_1, total_cost_transform_2), making the expensive-to-compute CTE a great candidate for MATERIALIZED. The problem is that all of this is wrapped in a larger CTE (e.g. prior_period or curr_period) that can't be extracted into its own query for the reasons stated in my earlier comment.

Please let me know if I'm misunderstanding something.

Regardless, just wanted to offer an example in support of the scoped version. But we can make it work either way, and what you've built already is massively helpful. 🙏

@adamshugar Thanks for the example, I understand your point.
I agree that properly scoped CTE will be helpful in your case, and in general it will make the feature easier to use for more users. Nevertheless, I want to hear some opinions from the core team first. We need to have a trade-off for usability and code complexity.
cc @alexey-milovidov as well.

Resolve conflicts

Signed-off-by: Duc Canh Le <[email protected]>
@canhld94
Copy link
Contributor Author

canhld94 commented May 1, 2024

Just an update, after checking code today I think it's possible to implement materialized CTE with properly scope using new analyzer. Also it's possible to implement it as an optimisation: even without MATERIALIZED keyword, if a CTE is referred more than once then we can materialize it.

I will start working on this soon. Nevertheless, if you only need materialised CTE at query level, this PR will work w/o issue (we're running it now).

- Decouple CTE from QueryNode and UnionNode, a CTE will be represented in query tree by a CTENode with one child is the QueryNode or UnionNode

- Add MaterializeCTEPass that will materialize the CTENode

- TODO: add CTEScanStep as a query plan step and CTESource

Signed-off-by: Duc Canh Le <[email protected]>
@marbemac
Copy link

Just wanted to echo @adamshugar's comments. This change would be so powerful. My use case would also benefit greatly from scoped CTEs. I think I'm in a similar boat as @adamshugar - the product we're working on dynamically builds up queries for the end users, and we use CTEs to break the query up into parts which are referenced in several places, including other CTEs.

Also it's possible to implement it as an optimisation: even without MATERIALIZED keyword, if a CTE is referred more than once then we can materialize it.

Quick thought here - while the super majority of cases would likely benefit from materialized CTEs, there's at least one use case where I would not want the CTE to be materialized: when working with any of the random functions, for example to create mock data. IMHO automatically materializing is also just riskier in general.


@canhld94 does the latest support scoped CTEs? Eager to try it out and see how it impacts our use cases.

@marbemac
Copy link

Is the latest commit expected to be working? Built things locally to experiment, and I get the following error:

WITH all_traits AS MATERIALIZED
  (
    SELECT number FROM numbers(100)
  )
SELECT * FROM all_traits

Error: ERROR at Line 1: : DB::Exception: Table expression all_traits AS MATERIALIZED (SELECT number FROM numbers(100)) data must be initialized. In scope WITH all_traits AS MATERIALIZED (SELECT number FROM numbers(100)) SELECT * FROM all_traits

Adding Memory:

WITH all_traits AS MATERIALIZED
  (
    SELECT number FROM numbers(100)
  ) MEMORY
SELECT * FROM all_traits

Error: DB::Exception: Syntax error: failed at position 77 ('MEMORY') (line 4, col 5): MEMORY SELECT * FROM all_traits. Expected one of: ENGINE, token, Comma, FROM, SELECT

Thanks for working on this one, I really think it could be a great feature.

@jledentu
Copy link

jledentu commented Oct 8, 2024

Hi,

Any news @canhld94 ? Thank you for working on this 🙏 , this PR seems to be so promising! My use case is similar to @adamshugar's: our engine dynamically generates multiple CTEs in order to calculate formulas from various metrics, some CTEs potentially being used in other CTEs. When the number of CTEs is low, ClickHouse is highly faster than our previous Postgres implementation. But for queries containing a large number of CTEs used several times, queries are considerably slower (1s -> 10s sometimes).

I think that this PR would help a lot.

@jirislav
Copy link
Contributor

Very helpful @canhld94, thanks! What needs to happen to have this upstream?

From what I can see:

  1. Use the new analyzer. Is this neccesary?
  2. Rebase (the branch is more than 19k commits behind).
  3. Get attention of someone able to do code review and merge it. Could you @divanik please help us out with finding someone? 🙏🏼

The PR itself is so big I would suggest implementing the below mentioned points as separate PRs:

  1. Supporting sub-query CTE materialization.
  2. Use optimization to automatically decide on CTE materialization.

@alexey-milovidov alexey-milovidov mentioned this pull request Dec 31, 2024
76 tasks
@marbemac
Copy link

Love that this is included in the 2025 roadmap @alexey-milovidov! Any idea re priority / amount of effort left beyond this PR? @canhld94 are you guys using this PR or a version of this pr at ahrefs by any chance, is it still important to you?

@canhld94
Copy link
Contributor Author

canhld94 commented Jan 28, 2025

Love that this is included in the 2025 roadmap @alexey-milovidov! Any idea re priority / amount of effort left beyond this PR? @canhld94 are you guys using this PR or a version of this pr at ahrefs by any chance, is it still important to you?

@marbemac Yes, we're using this feature in our prod and it's really cool. Nevertheless, because we want it to work with both old and new analyzer, the implementation still has some limitations, e.g. materialized CTE does't work in sub-query.

We're migrating to new analyzer, and it should be done before end of Q1. After that we have plan to re-work this feature in new analyzer. So, yes, we will try to make it to upstream (we're 1-man army team now, so it's a bit busy tbh).

@novikd novikd self-assigned this Aug 4, 2025
@clickhouse-gh
Copy link

clickhouse-gh bot commented Sep 9, 2025

Dear @novikd, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@marbemac
Copy link

hi @canhld94 - just circling back around to see if you guys are still using this? and wondering if it'll make it in in 2025 🙏

@novikd novikd self-assigned this Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WITH cte AS [NOT] MATERIALIZED (SELECT ...) SELECT * FROM cte

9 participants