Skip to content

SQL: Large cartesian product causes very long query runtime #6074

@philrz

Description

@philrz

Even when they contain filters that might reduce the result set, queries containing cross joins that produce a large cartesian product currently have running times that seem to hang indefinitely.

Details

Repro is with super commit 230007c. This was found via a query from a sqllogictest.

Many of the sqllogictests in that particular suite perform a cross join that, based on the SELECT and FROM alone, would generate a massive cartesian product, but they also include filtering that cause them to ultimately produce a small result.

To see how Postgres handles it, we'll load the data using the attached ddl.txt and time the running of the query.

$ psql --version
psql (PostgreSQL) 17.5 (Homebrew)

$ psql postgres -q -c "CREATE DATABASE foo;"

$ psql foo -q -f ddl.txt

$ time psql foo -c "
SELECT b2, d6, b9*398, c1, e3*353+b9
  FROM t1, t9, t6, t3, t2
 WHERE d6 in (885,924,457,578,786,664)
   AND 488=d2
   AND a3=b9
   AND c9=688
   AND a1=d9;"

 b2 | d6  | ?column? | c1  | ?column? 
----+-----+----------+-----+----------
 93 | 924 |     1990 | 585 |   286994
 93 | 457 |     1990 | 585 |   286994
 93 | 786 |     1990 | 585 |   286994
 93 | 578 |     1990 | 585 |   286994
 93 | 664 |     1990 | 585 |   286994
 93 | 885 |     1990 | 585 |   286994
(6 rows)

real	0m0.024s
user	0m0.010s
sys	0m0.006s

The attached foo.tgz contains Parquet files of the dumped contents of the same tables referenced in the SQL query. If I kick off super running the same query against that data, my Intel Macbook shows the process consuming 100%+ CPU continuously and runs for tens of minutes without ever producing a result.

$ super -version
Version: 230007cb7

$ super -c "
SELECT b2, d6, b9*398, c1, e3*353+b9
  FROM t1, t9, t6, t3, t2
 WHERE d6 in (885,924,457,578,786,664)
   AND 488=d2
   AND a3=b9
   AND c9=688
   AND a1=d9;"

[runs for tens of minutes without producing a result]

As @nwt confirmed, with the initial cross join support added in #6064, no optimizations yet exist to reduce the runtime in the same way that Postgres seems to be doing. Therefore the high runtime we're observing follows directly from the large cartesian product, i.e.:

$ for file in t*; do   super -c "count()" $file; done | super -f text -
128
113
129
92
98

$ expr 128 \* 113 \* 129 \* 92 \* 98
16822557696

Cranking through 16-billion combinations would indeed take a while.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions