Sequence match quick check by jkuklis · Pull Request #27729 · ClickHouse/ClickHouse

jkuklis · 2021-08-16T08:52:35Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

Performance Improvement

Changelog entry:
Introducing two checks in sequenceMatch and sequenceCount that allow for early exit when some deterministic part of the sequence pattern is missing from the events list. This change unlocks many queries that would previously fail due to reaching operations cap, and generally speeds up the pipeline.

Detailed description

As of August 2021, ClickHouse queries containing function sequenceMatch are among the slowest queries at Contentsquare, timing out sporadically. The current design of sequenceMatch has two major limitations:

whenever constraints for time between events are used (which is the case for most of Contentsquare queries), a backtracking algorithm is run, which might need an exponential number of operations,
a cap on the number of backtracking operations is employed, preventing many queries from finishing, even very simple ones.

Most of the time when we use sequenceMatch it comes after a GROUP BY operation, and generally only a couple of percent of groups match. Making sure that the operations limit is not reached is a challenge on its own. Thus, we propose two simple checks for the presence of deterministic fragments of the sequence pattern; these unlock many queries that would fail previously, speed up the execution of sequenceMatch when the answer is negative, and don't incur a large overhead for when the answer is positive, making regular queries faster in general:

Check for presence of sequenceMatch conditions in events lists - a check to be run at the start of sequenceMatch and sequenceCount, which would only check if all conditions occurring in the sequence pattern are present in the events list. The information about the conditions met in an events list would be efficiently stored and maintained along the list itself. This allows for fast rejection of sequences lacking some token from the sequence pattern. Furthermore, in these cases the events list does not need to be sorted.
Check for deterministic fragments - a check to be run before the backtracking algorithm is used in sequenceMatch and sequenceCount. In this check, conceptually the pattern is first split into deterministic fragments separated by nondeterministic parts (time constraints and Kleene star .*) and then a search for deterministic fragments is started. Each next deterministic fragment should appear after the previous one in the pattern. If some deterministic fragment is not found this way, sequenceMatch will surely fail, so the algorithm can exit at this point. Nondeterministic parts are ignored during this check.

Technical considerations

Check 1:

With each data structure for maintaining events lists (AggregateFunctionSequenceMatchData), we store a bitset with the conditions met in that events list. When adding a single event or merging two lists of events, the conditions for groups to merge are OR-ed together to form the conditions for the merged list.

Currently whenever two lists of events are merged, each of them is sorted if necessary, and then the merged lists are linearly sorted in place. Check 1 allows for omitting these sorts whenever it detects that some condition from the sequence pattern is not present in the events list, improving the overall performance.

Check 2:

This check has time complexity O(n * min(n,m) + m), where n is the length of the events list and m is the length of the sequence pattern, and operates in constant space.

The backtracking approach can require an exponential number of steps and has space complexity O(m). However, the algorithm stops when a limit of operations is met. To avoid scanning the whole list of events when a backtracking algorithm would not be able to do so, a limit of events is also used for this check, with the same cap as the limit of operations. This guarantees that for negative answers the number of operations will not be larger for the new approach than it used to be for the original code.

For queries with positive answers, the number of operations required with the new approach is at most twice as large as for the original code.

Evaluation

We conducted experiments with small manual tables and hits_v1 dataset. We found that a lot of previously locked queries can now finish and that regular queries got faster (usually 10+% faster). We obtained our results on an DELL XPS15 with an Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, using clickhouse-benchmark, and confirming the general trends with perf sampling and clickhouse-flamegraph.

The original code can reach the 1kk operations cap for a database + query pattern as small as 36 rows + 19 pattern tokens:

CREATE TABLE t (time UInt8, number UInt8) ENGINE = Memory;
INSERT INTO t VALUES (1, 1), (2, 2), (3, 1), (4, 2), (5, 1), (6, 2), (7, 1), (8, 2), (9, 1), (10, 2), (11, 1), (12, 2), (13, 1), (14, 2), (15, 1), (16, 2), (17, 1), (18, 2), (19, 1), (20, 2), (21, 1), (22, 2), (23, 1), (24, 2), (25, 1), (26, 2), (27, 1), (28, 2), (29, 1), (30, 2), (31, 1), (32, 2), (33, 1), (34, 2), (35, 1), (36, 2);
SELECT sequenceMatch('(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?1)(?t>0)(?3)')(time, number = 1, number = 2, number = 3) FROM t;

The original algorithm backtracks a lot for this example, even though the last token in the pattern, (?3), doesn’t ever occur in the events list. The modified approach with extra checks deals with this example easily.

For hits_v1 dataset, an example queries that fail due to the same reason, despite having a pattern with 3 tokens only, but with one of the conditions (Age = -1) never met in the table:

SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY ClientIP HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);
SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY UserID HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);

It is actually hard to find a query of this form for hits_v1 that would not fail due to exceeding the cap with the original code. One GROUP BY that works is one for EventTime:

SELECT COUNT(*) FROM datasets.hits_v1 GROUP BY EventTime HAVING sequenceMatch('(?1)(?t<1)(?2)')(EventTime, Age >= 0, Age = -1);

This is an example of a pathological query known to always return no matches, since Age is never negative. For this particular query the modified approach is >25% faster.

An example of a regular query that reaches the operations cap with the original code, but passes with the modified code:

SELECT ClientIP FROM datasets.hits_v1 WHERE notEmpty(URLCategories) GROUP BY ClientIP HAVING sequenceMatch('(?1)(?t<=10000)(?2)')(EventTime, hasAny(URLCategories, [9, 14837]), hasAny(URLCategories, [106]), hasAny(URLCategories, [3676]));

For a fairer performance comparison, we experimented with queries with different amounts of positive answers (~1-50%), and we found them all to be faster. Note that here we need to restrict ourselves to queries which finish without reaching the operations cap with the original code.

For example:

(with time constraints)

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?t>1000)(?3)')(EventTime, hasAny(RefererCategories, [9]), hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAll(RefererCategories, [1, 9]), hasAny(RefererCategories, [1, 2326, 5496]))

or

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?t<10000)(?2)')(EventTime, hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAny(RefererCategories, [1, 2]))

or

(no time constraints)

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?3)(?1)(?3)')(EventTime, hasAny(RefererCategories, [9]), hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAll(RefererCategories, [1, 9]), hasAny(RefererCategories, [1, 2326, 5496]))

or

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)(?2)(?1)(?2)(?1)')(EventTime, hasAny(RefererCategories, [3849, 2, 3, 4, 5, 6, 7]), hasAny(RefererCategories, [1, 2]))

were all >10% faster for the modified approach. The numbers in RefererCategories conditions were chosen ad-hoc for these examples.

We haven't found queries of the form SELECT 1 FROM datasets.hits_v1 WHERE ... GROUP BY ... HAVING sequenceMatch ... which would be noticeably slower for the modified approach. For example, the modified approach is less than 1% slower with a query with only positive answers like

SELECT 1 FROM datasets.hits_v1 WHERE RefererCategories != [] GROUP BY ClientIP, RequestNum HAVING sequenceMatch('(?1)')(EventTime, Age >= 0, Age = -1)

Obtained performance gains are in line with our flame graphs inspection: we heavily reduce the cost of backtrackingMatch while introducing only a small overhead with our checks.

While Check 1 and Check 2 have a similar role in allowing an early exit from sequenceMatch, and most of the performance gain can be obtained with Check 1 only, in our experiments the code with two checks is usually a little bit faster than the code with Check 1 only, as Check 2 prevents the occurrence of some possibly costly pessimistic cases that are not detected by Check 1. This is why we decided to propose the introduction of both checks.

Thanks for attention, let us know what you think!

Jakub Kuklis
Contentsquare

…order in sequenceMatch

Avogar · 2021-08-20T11:45:50Z

That's a great improvement, thanks a lot for your work! Can you add a performance test with some queries that have become faster?
You can read how to create a performance test here:
https://github.com/ClickHouse/ClickHouse/blob/master/tests/performance/README.md
If you need any help with adding a performance test, let me know.

…ut in PR

jkuklis · 2021-08-26T11:01:07Z

Hi!
I added a performance test with the same queries I mentioned in the PR description and with the hits_100m_single dataset. It turned out to be too slow for the PR check, so I changed the dataset to hits_10m_single, however now I see that the PR performance tests are stuck at 66 out of 220.
Is it possible to rerun the checks (or, even better, only the performance tests)? Could you help me with that? Or shall I just push a dummy commit to rerun them?
Also, other than the performance test, is this PR all fine?
Thanks!

Avogar · 2021-08-26T13:15:38Z

Performance test was stopped automatically because of marker check failure. Please, merge master into your branch (some test failures were fixed). The code LGTM and after looking at the results of performance tests I will merge this PR.

… the test, additional comments in the code

…ngle datasets have a slightly different schema

jkuklis · 2021-08-30T07:54:20Z

The tests have finished, seemingly everything went well. All the sequenceMatch queries from the performance test got faster, with the ones presented in the PR having speedups from 1.086x to 1.402x (and one 4.413x outlier).

filimonov · 2021-08-30T09:06:03Z

Related
#26095
#4004

Jakub Kuklis and others added 9 commits August 4, 2021 11:03

Checking whether all deterministic parts can be matched in the right …

e3a96cc

…order in sequenceMatch

Adding a break before default in switch

2fad1dd

Removing default placed after an exhaustive sweep through enum values

6a39546

Events conditions met using bitsets

24db649

No sorting if unnecessary

5abf24d

Code clearance

3f813e7

Correction for style guidelines

edac57b

Simplifying couldMatchDeterministicParts

b9bb2b5

Merge branch 'ClickHouse:master' into sequenceMatchQuickCheck

27e5001

robot-clickhouse added the pr-performance Pull request with some performance improvements label Aug 16, 2021

vdimir added the can be tested label Aug 16, 2021

Avogar self-assigned this Aug 17, 2021

jkuklis added 2 commits August 23, 2021 11:03

Adding a performance test with queries presented in the PR

caf40d7

Use smaller dataset in the performance test so that it doesn't time o…

39ba4f6

…ut in PR

ClickHouse deleted a comment from mergify bot Aug 26, 2021

jkuklis added 2 commits August 26, 2021 16:42

Adding Format Null to performance test queries, style improvement for…

c60e935

… the test, additional comments in the code

Changing the tests to use mainly test.hits dataset, as hits_10[0]m_si…

7e3e050

…ngle datasets have a slightly different schema

Avogar merged commit 95fd619 into ClickHouse:master Aug 30, 2021

yard mentioned this pull request May 20, 2022

sequenceMatch() and sequenceCount() return wrong results when run over a distributed table #37296

Closed

nhoeft mentioned this pull request Sep 5, 2022

sequenceMatch returns inconsistent results with vs. without max_bytes_before_external_group_by #40231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sequence match quick check#27729

Sequence match quick check#27729
Avogar merged 13 commits intoClickHouse:masterfrom
ContentSquare:sequenceMatchQuickCheck

jkuklis commented Aug 16, 2021 •

edited

Loading

Uh oh!

Avogar commented Aug 20, 2021 •

edited

Loading

Uh oh!

jkuklis commented Aug 26, 2021

Uh oh!

Avogar commented Aug 26, 2021

Uh oh!

jkuklis commented Aug 30, 2021

Uh oh!

filimonov commented Aug 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jkuklis commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Detailed description

Technical considerations

Evaluation

Uh oh!

Avogar commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkuklis commented Aug 26, 2021

Uh oh!

Avogar commented Aug 26, 2021

Uh oh!

jkuklis commented Aug 30, 2021

Uh oh!

filimonov commented Aug 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jkuklis commented Aug 16, 2021 •

edited

Loading

Avogar commented Aug 20, 2021 •

edited

Loading