Reading CSV/TSV files with no header

SuperDB's CSV/TSV readers currently depend on the presence of a header row to explicitly specify field names, whereas other SQL systems can let in unnamed fields which can then be subjected to schema/casting.

## Details

Repro is with super commit cf318c7.

While there's plenty of header-free CSV data floating around that might bump into this limitation, a specific example that's top-of-mind right now is the [ClickBench](https://github.com/ClickHouse/ClickBench) test data where the [hits.csv](https://datasets.clickhouse.com/hits_compatible/hits.csv.gz) and [hits.tsv](https://datasets.clickhouse.com/hits_compatible/hits.tsv.gz) are both provided without headers. This doesn't pose a problem for other tested systems because they all have [DDL-defined schemas](https://github.com/ClickHouse/ClickBench/blob/main/duckdb/create.sql) that are [applied at the same time they read the header-less files](https://github.com/ClickHouse/ClickBench/blob/8440f516891d7b75563e895f9bd2a1b413d38970/duckdb/benchmark.sh#L20), resulting in properly-typed, stored data that has unique column names.

To show some examples with simplified data, let's start with this simple `data.csv` that _does_ have a header.

```
foo,bar
1,2
3,4
```

SuperDB, DuckDB, and ClickHouse all do what seems to be the right thing with it.

```
$ super -version
Version: cf318c79c

$ super -c "SELECT * FROM 'data.csv'"
{foo:1.,bar:2.}
{foo:3.,bar:4.}

$ duckdb --version
v1.2.2 7c039464e4

$ duckdb -c "SELECT * FROM 'data.csv'"
┌───────┬───────┐
│  foo  │  bar  │
│ int64 │ int64 │
├───────┼───────┤
│     1 │     2 │
│     3 │     4 │
└───────┴───────┘

$ clickhouse --version
ClickHouse local version 25.4.1.2934 (official build).

$ clickhouse --output-format TabSeparatedWithNames --query "SELECT * FROM 'data.csv';"
foo	bar
1	2
3	4
```

However, if I drop the header in this different input file `data-noheader.csv`:

```
1,2
3,4
```

Now we start to see the difference in how the systems handle it.

```
$ super -c "SELECT * FROM 'data-noheader.csv'"
{"1":3.,"2":4.}

$ duckdb -c "SELECT * FROM 'data-noheader.csv'"
┌─────────┬─────────┐
│ column0 │ column1 │
│  int64  │  int64  │
├─────────┼─────────┤
│       1 │       2 │
│       3 │       4 │
└─────────┴─────────┘

$ clickhouse --output-format TabSeparatedWithNames --query "SELECT * FROM 'data-noheader.csv';"
c1	c2
1	2
3	4
```

It's interesting how the other systems concluded on their own that the data was header-free and therefore cooked up their own unique headers. Using DuckDB as an example, it looks like this is the effect of their [Header Detection](https://duckdb.org/2023/10/27/csv-sniffer.html#header-detection) feature. Of course, this heuristic becomes tricky with different data, e.g., it's more ambiguous with this all-string data in file `data2-noheader.csv`:

```
hello,world
bye,friends
```

Where by default DuckDB happens to treat that first row as a header.

```
$ duckdb -c "SELECT * FROM 'data2-noheader.csv'"
┌─────────┬─────────┐
│  hello  │  world  │
│ varchar │ varchar │
├─────────┼─────────┤
│ bye     │ friends │
└─────────┴─────────┘
```

If this is not what the user wanted though, they have their `header=false` flag that can toggle this and get back the auto-generated unique column names.

```
$ duckdb -c "SELECT * FROM read_csv('data2-noheader.csv',header=false)"
┌─────────┬─────────┐
│ column0 │ column1 │
│ varchar │ varchar │
├─────────┼─────────┤
│ hello   │ world   │
│ bye     │ friends │
└─────────┴─────────┘
```

Also note this issue has some overlap with existing issue #3758, which pre-dates our efforts to support SQL. In that case an explicit schema would have been provided to play the role of the missing header. While that would surely still be handy in some situations, being able to get the data in without needing any kind of header/schema does also seem useful.

## Bottom Line

While users might appreciate the fancy "header detection" stuff, as a first step just having something like the `header=false` option seems like it would be a big improvement. Once the data is read successfully (e.g., perhaps to unique, auto-generated column names like shown in the other systems above) into a `this` record value it could be subject to shaping that would assign unique field names.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV/TSV files with no header #5863

Details

Bottom Line

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reading CSV/TSV files with no header #5863

Description

Details

Bottom Line

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions