Skip to content

Sequence match quick check#27729

Merged
Avogar merged 13 commits intoClickHouse:masterfrom
ContentSquare:sequenceMatchQuickCheck
Aug 30, 2021
Merged

Sequence match quick check#27729
Avogar merged 13 commits intoClickHouse:masterfrom
ContentSquare:sequenceMatchQuickCheck

Conversation

@jkuklis
Copy link
Copy Markdown
Contributor

@jkuklis jkuklis commented Aug 16, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Performance Improvement

Changelog entry:
Introducing two checks in sequenceMatch and sequenceCount that allow for early exit when some deterministic part of the sequence pattern is missing from the events list. This change unlocks many queries that would previously fail due to reaching operations cap, and generally speeds up the pipeline.

Detailed description

As of August 2021, ClickHouse queries containing function sequenceMatch are among the slowest queries at Contentsquare, timing out sporadically. The current design of sequenceMatch has two major limitations:

  • whenever constraints for time between events are used (which is the case for most of Contentsquare queries), a backtracking algorithm is run, which might need an exponential number of operations,
  • a cap on the number of backtracking operations is employed, preventing many queries from finishing, even very simple ones.

Most of the time when we use sequenceMatch it comes after a GROUP BY operation, and generally only a couple of percent of groups match. Making sure that the operations limit is not reached is a challenge on its own. Thus, we propose two simple checks for the presence of deterministic fragments of the sequence pattern; these unlock many queries that would fail previously, speed up the execution of sequenceMatch when the answer is negative, and don't incur a large overhead for when the answer is positive, making regular queries faster in general:

  1. Check for presence of sequenceMatch conditions in events lists - a check to be run at the start of sequenceMatch and sequenceCount, which would only check if all conditions occurring in the sequence pattern are present in the events list. The information about the conditions met in an events list would be efficiently stored and maintained along the list itself. This allows for fast rejection of sequences lacking some token from the sequence pattern. Furthermore, in these cases the events list does not need to be sorted.

  2. Check for deterministic fragments - a check to be run before the backtracking algorithm is used in sequenceMatch and sequenceCount. In this check, conceptually the pattern is first split into deterministic fragments separated by nondeterministic parts (time constraints and Kleene star .*) and then a search for deterministic fragments is started. Each next deterministic fragment should appear after the previous one in the pattern. If some deterministic fragment is not found this way, sequenceMatch will surely fail, so the algorithm can exit at this point. Nondeterministic parts are ignored during this check.

Technical considerations

Check 1:

With each data structure for maintaining events lists (AggregateFunctionSequenceMatchData), we store a bitset with the conditions met in that events list. When adding a single event or merging two lists of events, the conditions for groups to merge are OR-ed together to form the conditions for the merged list.

Currently whenever two lists of events are merged, each of them is sorted if necessary, and then the merged lists are linearly sorted in place. Check 1 allows for omitting these sorts whenever it detects that some condition from the sequence pattern is not present in the events list, improving the overall performance.

Check 2:

This check has time complexity O(n * min(n,m) + m), where n is the length of the events list and m is the length of the sequence pattern, and operates in constant space.

The backtracking approach can require an exponential number of steps and has space complexity O(m). However, the algorithm stops when a limit of operations is met. To avoid scanning the whole list of events when a backtracking algorithm would not be able to do so, a limit of events is also used for this check, with the same cap as the limit of operations. This guarantees that for negative answers the number of operations will not be larger for the new approach than it used to be for the original code.

For queries with positive answers, the number of operations required with the new approach is at most twice as large as for the original code.

Evaluation

We conducted experiments with small manual tables and hits_v1 dataset. We found that a lot of previously locked queries can now finish and that regular queries got faster (usually 10+% faster). We obtained our results on an DELL XPS15 with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, using clickhouse-benchmark, and confirming the general trends with perf sampling and clickhouse-flamegraph.

The original code can reach the 1kk operations cap for a database + query pattern as small as 36 rows + 19 pattern tokens:

CREATE TABLE t (time UInt8, number UInt8) ENGINE = Memory;
INSERT INTO t VALUES (1, 1), (2, 2), (3, 1), (4, 2), (5, 1), (6, 2), (7, 1), (8, 2), (9, 1), (10, 2), (11, 1), (12, 2), (13, 1), (14, 2), (15, 1), (16, 2), (17, 1), (18, 2), (19, 1), (20, 2), (21, 1), (22, 2), (23, 1), (24, 2), (25, 1), (26, 2), (27, 1), (28, 2), (29, 1), (30, 2), (31, 1), (32, 2), (33, 1), (34, 2), (35, 1), (36, 2);
SELECT sequenceMatch('(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?3)')(time, number = 1, number = 2, number = 3) FROM t;

The original algorithm backtracks a lot for this example, even though the last token in the pattern, (?3), doesn’t ever occur in the events list. The modified approach with extra checks deals with this example easily.

For hits_v1 dataset, an example queries that fail due to the same reason, despite having a pattern with 3 tokens only, but with one of the conditions (Age = -1) never met in the table:

SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY ClientIP HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);
SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY UserID HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);

It is actually hard to find a query of this form for hits_v1 that would not fail due to exceeding the cap with the original code. One GROUP BY that works is one for EventTime:

SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY EventTime HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);

This is an example of a pathological query known to always return no matches, since Age is never negative. For this particular query the modified approach is >25% faster.

An example of a regular query that reaches the operations cap with the original code, but passes with the modified code:

SELECT ClientIP FROM datasets.hits_v1 WHERE notEmpty(URLCategories) GROUP BY ClientIP HAVING sequenceMatch('(?1)(?t<=10000)(?2)')(EventTime, hasAny(URLCategories, [9, 14837]), hasAny(URLCategories, [106]), hasAny(URLCategories, [3676]));

For a fairer performance comparison, we experimented with queries with different amounts of positive answers (~1-50%), and we found them all to be faster. Note that here we need to restrict ourselves to queries which finish without reaching the operations cap with the original code.

For example:

(with time constraints)

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?t>1000)(?3)')(EventTime, hasAny(RefererCategories, [9]), hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAll(RefererCategories, [1, 9]), hasAny(RefererCategories, [1, 2326, 5496]))

or

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?t<10000)(?2)')(EventTime, hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAny(RefererCategories, [1, 2]))

or

(no time constraints)

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?3)(?1)(?3)')(EventTime, hasAny(RefererCategories, [9]), hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAll(RefererCategories, [1, 9]), hasAny(RefererCategories, [1, 2326, 5496]))

or

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?2)(?1)(?2)(?1)')(EventTime, hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAny(RefererCategories, [1, 2]))

were all >10% faster for the modified approach. The numbers in RefererCategories conditions were chosen ad-hoc for these examples.

We haven't found queries of the form SELECT 1 FROM datasets.hits_v1 WHERE ... GROUP BY ... HAVING sequenceMatch ... which would be noticeably slower for the modified approach. For example, the modified approach is less than 1% slower with a query with only positive answers like

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)')(EventTime, Age >= 0, Age = -1)

Obtained performance gains are in line with our flame graphs inspection: we heavily reduce the cost of backtrackingMatch while introducing only a small overhead with our checks.

While Check 1 and Check 2 have a similar role in allowing an early exit from sequenceMatch, and most of the performance gain can be obtained with Check 1 only, in our experiments the code with two checks is usually a little bit faster than the code with Check 1 only, as Check 2 prevents the occurrence of some possibly costly pessimistic cases that are not detected by Check 1. This is why we decided to propose the introduction of both checks.

Thanks for attention, let us know what you think!

Jakub Kuklis
Contentsquare

@robot-clickhouse robot-clickhouse added the pr-performance Pull request with some performance improvements label Aug 16, 2021
@Avogar Avogar self-assigned this Aug 17, 2021
@Avogar
Copy link
Copy Markdown
Member

Avogar commented Aug 20, 2021

That's a great improvement, thanks a lot for your work! Can you add a performance test with some queries that have become faster?
You can read how to create a performance test here:
https://github.com/ClickHouse/ClickHouse/blob/master/tests/performance/README.md
If you need any help with adding a performance test, let me know.

@jkuklis
Copy link
Copy Markdown
Contributor Author

jkuklis commented Aug 26, 2021

Hi!
I added a performance test with the same queries I mentioned in the PR description and with the hits_100m_single dataset. It turned out to be too slow for the PR check, so I changed the dataset to hits_10m_single, however now I see that the PR performance tests are stuck at 66 out of 220.
Is it possible to rerun the checks (or, even better, only the performance tests)? Could you help me with that? Or shall I just push a dummy commit to rerun them?
Also, other than the performance test, is this PR all fine?
Thanks!

@ClickHouse ClickHouse deleted a comment from mergify bot Aug 26, 2021
@Avogar
Copy link
Copy Markdown
Member

Avogar commented Aug 26, 2021

Performance test was stopped automatically because of marker check failure. Please, merge master into your branch (some test failures were fixed). The code LGTM and after looking at the results of performance tests I will merge this PR.

@jkuklis
Copy link
Copy Markdown
Contributor Author

jkuklis commented Aug 30, 2021

The tests have finished, seemingly everything went well. All the sequenceMatch queries from the performance test got faster, with the ones presented in the PR having speedups from 1.086x to 1.402x (and one 4.413x outlier).

Screenshot from 2021-08-30 09-45-10

@filimonov
Copy link
Copy Markdown
Contributor

Related
#26095
#4004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants