-
Notifications
You must be signed in to change notification settings - Fork 8k
Materialized CTE #61086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Materialized CTE #61086
Conversation
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
|
This is an automated comment for commit 790ba94 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
Successful checks
|
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
|
Just want to say this feature is incredibly necessary! We have CTEs being re-run all over the place, but it would be super cumbersome and error-prone to extract them all into temp tables. Keep up the awesome work, really excited to see this merged :) |
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
|
Awesome work! |
Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
…eeded for index Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
…ted query Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
4c3ba37 to
a685d32
Compare
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
Signed-off-by: Duc Canh Le <[email protected]>
…e for index analysis Signed-off-by: Duc Canh Le <[email protected]>
…subquery Signed-off-by: Duc Canh Le <[email protected]>
0671b27 to
f34f8cb
Compare
|
Hi @kitaisreal @KochetovNicolai can you help to take a look at this PR? Briefly about the implementation (same for analyzer and non-analyzer):
WDYT about this? For us, this is sufficient (and I think this is sufficient for most users) but I can make a more sophisticated implementation (with proper scope for materialized CTE) if needed. |
Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>
|
@canhld94 This is amazing work - just wanted to speak to our use case that would be made a lot more straightforward with properly scoped materialized CTEs. We have a query builder / BI tool type web dashboard for analyzing cloud usage and cost, with which the user can construct complicated queries. Relevant features include "compare to a prior period" and "aggregate cost across multiple usage types or cloud services". So in our backend, the constructed queries look structurally something like this (although they can get significantly more nested): We intentionally construct one very complex query rather than multiple sequential queries for a couple reasons:
To make our use case work with this materialized CTE implementation, we could in theory extract all CTEs at every level of nesting to the top level and prefix them (e.g. Maybe I'm overlooking something or there's an alternative approach we could take with our query builder? |
|
@adamshugar yes, it is possible to implement properly scoped CTE with new analyzer. But I hardly see any case that materialized cte in deep subquery will bring benefit. It looks like you only need |
|
@canhld94 Apologies, I should've given more detail. In our case: This may seem a little contrived, and I'm happy to share some more concrete examples. But the queries can get quite complex and I don't want to add too much irrelevant information. Distilled to its essence, our use case is: you have an expensive-to-compute CTE (e.g. Please let me know if I'm misunderstanding something. Regardless, just wanted to offer an example in support of the scoped version. But we can make it work either way, and what you've built already is massively helpful. 🙏 |
@adamshugar Thanks for the example, I understand your point. |
Resolve conflicts Signed-off-by: Duc Canh Le <[email protected]>
|
Just an update, after checking code today I think it's possible to implement materialized CTE with properly scope using new analyzer. Also it's possible to implement it as an optimisation: even without I will start working on this soon. Nevertheless, if you only need materialised CTE at query level, this PR will work w/o issue (we're running it now). |
- Decouple CTE from QueryNode and UnionNode, a CTE will be represented in query tree by a CTENode with one child is the QueryNode or UnionNode - Add MaterializeCTEPass that will materialize the CTENode - TODO: add CTEScanStep as a query plan step and CTESource Signed-off-by: Duc Canh Le <[email protected]>
|
Just wanted to echo @adamshugar's comments. This change would be so powerful. My use case would also benefit greatly from scoped CTEs. I think I'm in a similar boat as @adamshugar - the product we're working on dynamically builds up queries for the end users, and we use CTEs to break the query up into parts which are referenced in several places, including other CTEs.
Quick thought here - while the super majority of cases would likely benefit from materialized CTEs, there's at least one use case where I would not want the CTE to be materialized: when working with any of the random functions, for example to create mock data. IMHO automatically materializing is also just riskier in general. @canhld94 does the latest support scoped CTEs? Eager to try it out and see how it impacts our use cases. |
|
Is the latest commit expected to be working? Built things locally to experiment, and I get the following error: WITH all_traits AS MATERIALIZED
(
SELECT number FROM numbers(100)
)
SELECT * FROM all_traitsError: Adding Memory: WITH all_traits AS MATERIALIZED
(
SELECT number FROM numbers(100)
) MEMORY
SELECT * FROM all_traitsError: Thanks for working on this one, I really think it could be a great feature. |
|
Hi, Any news @canhld94 ? Thank you for working on this 🙏 , this PR seems to be so promising! My use case is similar to @adamshugar's: our engine dynamically generates multiple CTEs in order to calculate formulas from various metrics, some CTEs potentially being used in other CTEs. When the number of CTEs is low, ClickHouse is highly faster than our previous Postgres implementation. But for queries containing a large number of CTEs used several times, queries are considerably slower (1s -> 10s sometimes). I think that this PR would help a lot. |
|
Very helpful @canhld94, thanks! What needs to happen to have this upstream? From what I can see:
The PR itself is so big I would suggest implementing the below mentioned points as separate PRs:
|
|
Love that this is included in the 2025 roadmap @alexey-milovidov! Any idea re priority / amount of effort left beyond this PR? @canhld94 are you guys using this PR or a version of this pr at ahrefs by any chance, is it still important to you? |
@marbemac Yes, we're using this feature in our prod and it's really cool. Nevertheless, because we want it to work with both old and new analyzer, the implementation still has some limitations, e.g. materialized CTE does't work in sub-query. We're migrating to new analyzer, and it should be done before end of Q1. After that we have plan to re-work this feature in new analyzer. So, yes, we will try to make it to upstream (we're 1-man army team now, so it's a bit busy tbh). |
|
Dear @novikd, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
|
hi @canhld94 - just circling back around to see if you guys are still using this? and wondering if it'll make it in in 2025 🙏 |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Syntax (inspired by Postgres):
WITH t AS MATERIALIZED (subquery).The implement is basic now, mostly influenced by implementation of
GLOBAL JOIN.Memory(default) orJoinengine.Close #53449
Documentation entry for user-facing changes