Skip to content

Optimize boolean conditions using CNF / DNF #11749

@filimonov

Description

@filimonov

The problem

The table a more than hundred cols but only 3 of them are indexed, let call them PARTITION A ORDER BY B, C, D

The query use A B and D and another cols which is not indexed E.
And it's look like

select * from T where
A = x and (
    (B = 1 and D = 2 and E = 3) or
    (B = 1 and D = 3 and E = 4) /* .... repeated a lot of times */
)

While the numbers of condition is small the query is quite fast (150ms), with 50 conditions the query start to take 10s and more.

Strangely if the wheres are re ordered like so :

select * from T where
A = x and (
    B = 1 and (
       D = 2 and (E = 3 or E=4 or E=5 /*.... repeated a lot of times */ ) OR
       D = 3 and (E = 3 or E=4 or E=5 /*.... repeated a lot of times */  ) OR
       /*.... repeated a lot of times */
   )
)

the query is sub seconds.
From the boolean point of view the filter is the same but it seems clickhouse like that a way more than the first one.

Proposed solution

Introduce optimizer(s) to convert logical expression to conjunctive normal form (CNF) or disjunctive normal form (DNF)

CNF is smth like

(c1 = 2 OR c1 = 5) AND (c2 < 5 OR c2 > 20) AND (c3 = 'abc' OR c3 = 'efg')

DNF is smth like

(c1 = 2 AND c2 = 5) OR (c2 > 5 AND c2 < 10) OR c3 = 'abc'

Any logical expression has CNF / DNF form.

In some cases, CNF or DNF may be more preferable and both have more chances to use index than non-normalized conditions.

Extra options

During those optimizations, it's may be possible to detect always true / always false sub-conditions and remove them.

Some references in context of DBs

Metadata

Metadata

Assignees

Labels

comp-query-optimizerQuery plan optimization: physical plan steps, plan-level rewrites and optimizations (QueryPlan pa...feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions