-
Notifications
You must be signed in to change notification settings - Fork 70
Description
In CSV output, right now SuperDB represents null values and empty strings the same. Other systems render these differently, and we might want to think about doing something similar.
Details
Repro is with super commit 2b862a5.
Given input data.json:
{"ThisISNull":null, "ThisIsEmpty":"", "ThisIString":"baz", "TheWordNullInString": "NULL"}
When rendering this as CSV, super currently shows nothing for both the null and empty string values.
$ super -version
Version: v1.18.0-449-g2b862a5e0
$ super -f csv data.json
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
,,baz,NULL
Whereas some SQL-based systems show:
$ duckdb --version
v1.2.2 7c039464e4
$ duckdb -csv -c "SELECT * FROM data.json"
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
NULL,"",baz,NULL
$ clickhouse --version
ClickHouse local version 25.4.2.31 (official build).
$ clickhouse --query "SELECT * FROM 'data.json'" --format csv
\N,"","baz","NULL"
Apparently in the ClickHouse case the "\N" is their way of trying to differentiate between what had been a NULL value in the table vs. a string that happens to contain the text "NULL". They have other settings that will vary this behavior, but that showed what they do by default.
CSV is notoriously difficult because it means distilling detailed data down to a very limited format, so I respect there's probably no one correct way to go here. However, the fact that other systems have chosen to offer different outputs in this situation suggests that users could benefit from SuperDB doing the same.
I happened to bump into this because running the ClickBench query 16 produces NULL values for some systems but the "nothing" for SuperDB, so my benchmark scripts that "diffs" CSV-based query outputs to make sure they're the same picked up on this as a material difference. I can put a workaround in my diff script to treat these as equivalent, but a change to SuperDB itself may also be justified.