Use query hash instead of string representation to handle the sample block cache#40065
Use query hash instead of string representation to handle the sample block cache#40065Algunenano wants to merge 13 commits intoClickHouse:masterfrom
Conversation
|
I had to add the alias to the hash because it was detecting queries as equal that aren't. A sample of something that breaks (in current code) because of the hash not including aliases is the scalar subquery cache:
The new one is the correct one, the old one is incorrectly reusing the first scalar result. It's a really odd corner case, but I guess it could happen in real queries. |
|
Now |
|
Do we still need the sample block cache? |
The query from the benchmark:
So I'd say yes, at least for now. |
e3ebdf3 to
8afd571
Compare
|
It seems there are several other places that relied on the |
|
While looking at the failing tests I've detected some other queries that were working incorrectly because the alias was ignored as part of the node hash. I've added some tests for those queries. The problem with the failed tests arose from, you guessed it, introducing the alias in the hash, as some parts of the code were implicitly relying on that behaviour. To workaround this I've added some helper methods to calculate other hashes were needed: get the hash of the contents of a subquery (ignoring its alias, id or cte name) to prepare and cache sets, and ignore the alias of functions when processing and optimizing queries based on constraints. I'm thinking that maybe the alias shouldn't be part of the node and be a node on itself, but it's likely better to delay massive changes around that until #23194 is clearer. In any case, the performance seems still good: master (8.454 QPS) vs PR (9.552 QPS) |
|
Although the query analyzer is being worked on, I still think this PR is worth including: it improves the performance of the sample block cache (and that is not going away I think) and fixes several bugs in the query interpretation. |
|
@kitaisreal said that Sample Block Cache is entirely unneeded if Analyzer is used. Ok, let's continue on this PR, but I did not review it yet... |
|
@Algunenano no need to continue this pr. |
Great!
I don't consider profile events are not only for clients, but also for debugging. Clients doesn't need to know anything about ZK either, but we do when investigating the behaviour of the query. Anyway, I'm closing the PR and I'll keep (already were) an eye on the analyzer PR to see that the issues found here are fixed by the analyzer too. |
|
@Algunenano thank you! |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Start pushing to speed up query interpretation, which has been degrading over time. Related to #39996
In this case, change how the sample block cache works to stop using the full query string and instead use the query hash. This is done because the call to
getTreeHashis much faster thanformatImpland should be equivalent. In my perf analysis of large queries the query serialization was taking almost 10% of the time:And once implemented I see an almost 10% improvement.
MASTER:
CHANGES:
I also added a couple of profile events because that's better than adding logs.
Marking it as draft for now because although I expect this to be equivalent I'm not 100% (almost but not 100%) confident all operations done during query interpretation will be properly reflected in the query hash. Let's see what the test think.