-
Notifications
You must be signed in to change notification settings - Fork 70
Description
SuperDB's CSV/TSV readers currently depend on the presence of a header row to explicitly specify field names, whereas other SQL systems can let in unnamed fields which can then be subjected to schema/casting.
Details
Repro is with super commit cf318c7.
While there's plenty of header-free CSV data floating around that might bump into this limitation, a specific example that's top-of-mind right now is the ClickBench test data where the hits.csv and hits.tsv are both provided without headers. This doesn't pose a problem for other tested systems because they all have DDL-defined schemas that are applied at the same time they read the header-less files, resulting in properly-typed, stored data that has unique column names.
To show some examples with simplified data, let's start with this simple data.csv that does have a header.
foo,bar
1,2
3,4
SuperDB, DuckDB, and ClickHouse all do what seems to be the right thing with it.
$ super -version
Version: cf318c79c
$ super -c "SELECT * FROM 'data.csv'"
{foo:1.,bar:2.}
{foo:3.,bar:4.}
$ duckdb --version
v1.2.2 7c039464e4
$ duckdb -c "SELECT * FROM 'data.csv'"
┌───────┬───────┐
│ foo │ bar │
│ int64 │ int64 │
├───────┼───────┤
│ 1 │ 2 │
│ 3 │ 4 │
└───────┴───────┘
$ clickhouse --version
ClickHouse local version 25.4.1.2934 (official build).
$ clickhouse --output-format TabSeparatedWithNames --query "SELECT * FROM 'data.csv';"
foo bar
1 2
3 4
However, if I drop the header in this different input file data-noheader.csv:
1,2
3,4
Now we start to see the difference in how the systems handle it.
$ super -c "SELECT * FROM 'data-noheader.csv'"
{"1":3.,"2":4.}
$ duckdb -c "SELECT * FROM 'data-noheader.csv'"
┌─────────┬─────────┐
│ column0 │ column1 │
│ int64 │ int64 │
├─────────┼─────────┤
│ 1 │ 2 │
│ 3 │ 4 │
└─────────┴─────────┘
$ clickhouse --output-format TabSeparatedWithNames --query "SELECT * FROM 'data-noheader.csv';"
c1 c2
1 2
3 4
It's interesting how the other systems concluded on their own that the data was header-free and therefore cooked up their own unique headers. Using DuckDB as an example, it looks like this is the effect of their Header Detection feature. Of course, this heuristic becomes tricky with different data, e.g., it's more ambiguous with this all-string data in file data2-noheader.csv:
hello,world
bye,friends
Where by default DuckDB happens to treat that first row as a header.
$ duckdb -c "SELECT * FROM 'data2-noheader.csv'"
┌─────────┬─────────┐
│ hello │ world │
│ varchar │ varchar │
├─────────┼─────────┤
│ bye │ friends │
└─────────┴─────────┘
If this is not what the user wanted though, they have their header=false flag that can toggle this and get back the auto-generated unique column names.
$ duckdb -c "SELECT * FROM read_csv('data2-noheader.csv',header=false)"
┌─────────┬─────────┐
│ column0 │ column1 │
│ varchar │ varchar │
├─────────┼─────────┤
│ hello │ world │
│ bye │ friends │
└─────────┴─────────┘
Also note this issue has some overlap with existing issue #3758, which pre-dates our efforts to support SQL. In that case an explicit schema would have been provided to play the role of the missing header. While that would surely still be handy in some situations, being able to get the data in without needing any kind of header/schema does also seem useful.
Bottom Line
While users might appreciate the fancy "header detection" stuff, as a first step just having something like the header=false option seems like it would be a big improvement. Once the data is read successfully (e.g., perhaps to unique, auto-generated column names like shown in the other systems above) into a this record value it could be subject to shaping that would assign unique field names.