SQL: Large cartesian product causes very long query runtime

Even when they contain filters that might reduce the result set, queries containing cross joins that produce a large cartesian product currently have running times that seem to hang indefinitely.

## Details

Repro is with super commit 230007c. This was found via a query from [a sqllogictest](https://github.com/duckdb/duckdb/blob/041b68304fee39b0eb770f745d1b282b7f9fe004/test/sqlite/select4.test_slow#L29191-L29200).

Many of the sqllogictests in that particular suite perform a cross join that, based on the `SELECT` and `FROM` alone, would generate a massive cartesian product, but they also include filtering that cause them to ultimately produce a small result.

To see how Postgres handles it, we'll load the data using the attached [ddl.txt](https://github.com/user-attachments/files/21497338/ddl.txt) and time the running of the query.

```
$ psql --version
psql (PostgreSQL) 17.5 (Homebrew)

$ psql postgres -q -c "CREATE DATABASE foo;"

$ psql foo -q -f ddl.txt

$ time psql foo -c "
SELECT b2, d6, b9*398, c1, e3*353+b9
  FROM t1, t9, t6, t3, t2
 WHERE d6 in (885,924,457,578,786,664)
   AND 488=d2
   AND a3=b9
   AND c9=688
   AND a1=d9;"

 b2 | d6  | ?column? | c1  | ?column? 
----+-----+----------+-----+----------
 93 | 924 |     1990 | 585 |   286994
 93 | 457 |     1990 | 585 |   286994
 93 | 786 |     1990 | 585 |   286994
 93 | 578 |     1990 | 585 |   286994
 93 | 664 |     1990 | 585 |   286994
 93 | 885 |     1990 | 585 |   286994
(6 rows)

real	0m0.024s
user	0m0.010s
sys	0m0.006s
```

The attached [foo.tgz](https://github.com/user-attachments/files/21497404/foo.tgz) contains Parquet files of the dumped contents of the same tables referenced in the SQL query. If I kick off `super` running the same query against that data, my Intel Macbook shows the process consuming 100%+ CPU continuously and runs for tens of minutes without ever producing a result.

```
$ super -version
Version: 230007cb7

$ super -c "
SELECT b2, d6, b9*398, c1, e3*353+b9
  FROM t1, t9, t6, t3, t2
 WHERE d6 in (885,924,457,578,786,664)
   AND 488=d2
   AND a3=b9
   AND c9=688
   AND a1=d9;"

[runs for tens of minutes without producing a result]
```

As @nwt confirmed, with the initial cross join support added in #6064, no optimizations yet exist to reduce the runtime in the same way that Postgres seems to be doing. Therefore the high runtime we're observing follows directly from the large cartesian product, i.e.:

```
$ for file in t*; do   super -c "count()" $file; done | super -f text -
128
113
129
92
98

$ expr 128 \* 113 \* 129 \* 92 \* 98
16822557696
```

Cranking through 16-billion combinations would indeed take a while.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQL: Large cartesian product causes very long query runtime #6074

Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SQL: Large cartesian product causes very long query runtime #6074

Description

Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions