Sequence match quick check#27729
Conversation
…order in sequenceMatch
|
That's a great improvement, thanks a lot for your work! Can you add a performance test with some queries that have become faster? |
|
Hi! |
|
Performance test was stopped automatically because of marker check failure. Please, merge master into your branch (some test failures were fixed). The code LGTM and after looking at the results of performance tests I will merge this PR. |
… the test, additional comments in the code
…ngle datasets have a slightly different schema

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Changelog entry:
Introducing two checks in
sequenceMatchandsequenceCountthat allow for early exit when some deterministic part of the sequence pattern is missing from the events list. This change unlocks many queries that would previously fail due to reaching operations cap, and generally speeds up the pipeline.Detailed description
As of August 2021, ClickHouse queries containing function
sequenceMatchare among the slowest queries at Contentsquare, timing out sporadically. The current design ofsequenceMatchhas two major limitations:Most of the time when we use
sequenceMatchit comes after aGROUP BYoperation, and generally only a couple of percent of groups match. Making sure that the operations limit is not reached is a challenge on its own. Thus, we propose two simple checks for the presence of deterministic fragments of the sequence pattern; these unlock many queries that would fail previously, speed up the execution ofsequenceMatchwhen the answer is negative, and don't incur a large overhead for when the answer is positive, making regular queries faster in general:Check for presence of
sequenceMatchconditions in events lists - a check to be run at the start ofsequenceMatchandsequenceCount, which would only check if all conditions occurring in the sequence pattern are present in the events list. The information about the conditions met in an events list would be efficiently stored and maintained along the list itself. This allows for fast rejection of sequences lacking some token from the sequence pattern. Furthermore, in these cases the events list does not need to be sorted.Check for deterministic fragments - a check to be run before the backtracking algorithm is used in
sequenceMatchandsequenceCount. In this check, conceptually the pattern is first split into deterministic fragments separated by nondeterministic parts (time constraints and Kleene star .*) and then a search for deterministic fragments is started. Each next deterministic fragment should appear after the previous one in the pattern. If some deterministic fragment is not found this way,sequenceMatchwill surely fail, so the algorithm can exit at this point. Nondeterministic parts are ignored during this check.Technical considerations
Check 1:
With each data structure for maintaining events lists (
AggregateFunctionSequenceMatchData), we store a bitset with the conditions met in that events list. When adding a single event or merging two lists of events, the conditions for groups to merge are OR-ed together to form the conditions for the merged list.Currently whenever two lists of events are merged, each of them is sorted if necessary, and then the merged lists are linearly sorted in place. Check 1 allows for omitting these sorts whenever it detects that some condition from the sequence pattern is not present in the events list, improving the overall performance.
Check 2:
This check has time complexity
O(n * min(n,m) + m), where n is the length of the events list and m is the length of the sequence pattern, and operates in constant space.The backtracking approach can require an exponential number of steps and has space complexity
O(m). However, the algorithm stops when a limit of operations is met. To avoid scanning the whole list of events when a backtracking algorithm would not be able to do so, a limit of events is also used for this check, with the same cap as the limit of operations. This guarantees that for negative answers the number of operations will not be larger for the new approach than it used to be for the original code.For queries with positive answers, the number of operations required with the new approach is at most twice as large as for the original code.
Evaluation
We conducted experiments with small manual tables and
hits_v1dataset. We found that a lot of previously locked queries can now finish and that regular queries got faster (usually 10+% faster). We obtained our results on anDELL XPS15 with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, usingclickhouse-benchmark, and confirming the general trends with perf sampling andclickhouse-flamegraph.The original code can reach the 1kk operations cap for a database + query pattern as small as 36 rows + 19 pattern tokens:
The original algorithm backtracks a lot for this example, even though the last token in the pattern,
(?3), doesn’t ever occur in the events list. The modified approach with extra checks deals with this example easily.For
hits_v1dataset, an example queries that fail due to the same reason, despite having a pattern with 3 tokens only, but with one of the conditions (Age = -1) never met in the table:It is actually hard to find a query of this form for
hits_v1that would not fail due to exceeding the cap with the original code. OneGROUP BYthat works is one forEventTime:SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY EventTime HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);This is an example of a pathological query known to always return no matches, since
Ageis never negative. For this particular query the modified approach is >25% faster.An example of a regular query that reaches the operations cap with the original code, but passes with the modified code:
SELECT ClientIP FROM datasets.hits_v1 WHERE notEmpty(URLCategories) GROUP BY ClientIP HAVING sequenceMatch('(?1)(?t<=10000)(?2)')(EventTime, hasAny(URLCategories, [9, 14837]), hasAny(URLCategories, [106]), hasAny(URLCategories, [3676]));For a fairer performance comparison, we experimented with queries with different amounts of positive answers (~1-50%), and we found them all to be faster. Note that here we need to restrict ourselves to queries which finish without reaching the operations cap with the original code.
For example:
(with time constraints)
SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?t>1000)(?3)')(EventTime, hasAny(RefererCategories, [9]), hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAll(RefererCategories, [1, 9]), hasAny(RefererCategories, [1, 2326, 5496]))or
SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?t<10000)(?2)')(EventTime, hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAny(RefererCategories, [1, 2]))or
(no time constraints)
SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?3)(?1)(?3)')(EventTime, hasAny(RefererCategories, [9]), hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAll(RefererCategories, [1, 9]), hasAny(RefererCategories, [1, 2326, 5496]))or
SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?2)(?1)(?2)(?1)')(EventTime, hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAny(RefererCategories, [1, 2]))were all >10% faster for the modified approach. The numbers in
RefererCategoriesconditions were chosen ad-hoc for these examples.We haven't found queries of the form
SELECT 1 FROM datasets.hits_v1 WHERE ... GROUP BY ... HAVING sequenceMatch ...which would be noticeably slower for the modified approach. For example, the modified approach is less than 1% slower with a query with only positive answers likeSELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)')(EventTime, Age >= 0, Age = -1)Obtained performance gains are in line with our flame graphs inspection: we heavily reduce the cost of
backtrackingMatchwhile introducing only a small overhead with our checks.While Check 1 and Check 2 have a similar role in allowing an early exit from sequenceMatch, and most of the performance gain can be obtained with Check 1 only, in our experiments the code with two checks is usually a little bit faster than the code with Check 1 only, as Check 2 prevents the occurrence of some possibly costly pessimistic cases that are not detected by Check 1. This is why we decided to propose the introduction of both checks.
Thanks for attention, let us know what you think!
Jakub Kuklis
Contentsquare