Skip to content

Differentiating null value and empty string in CSV output #5868

@philrz

Description

@philrz

In CSV output, right now SuperDB represents null values and empty strings the same. Other systems render these differently, and we might want to think about doing something similar.

Details

Repro is with super commit 2b862a5.

Given input data.json:

{"ThisISNull":null, "ThisIsEmpty":"", "ThisIString":"baz", "TheWordNullInString": "NULL"}

When rendering this as CSV, super currently shows nothing for both the null and empty string values.

$ super -version
Version: v1.18.0-449-g2b862a5e0

$ super -f csv data.json
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
,,baz,NULL

Whereas some SQL-based systems show:

$ duckdb --version
v1.2.2 7c039464e4

$ duckdb -csv -c "SELECT * FROM data.json"
ThisISNull,ThisIsEmpty,ThisIString,TheWordNullInString
NULL,"",baz,NULL

$ clickhouse --version
ClickHouse local version 25.4.2.31 (official build).

$ clickhouse --query "SELECT * FROM 'data.json'" --format csv
\N,"","baz","NULL"

Apparently in the ClickHouse case the "\N" is their way of trying to differentiate between what had been a NULL value in the table vs. a string that happens to contain the text "NULL". They have other settings that will vary this behavior, but that showed what they do by default.

CSV is notoriously difficult because it means distilling detailed data down to a very limited format, so I respect there's probably no one correct way to go here. However, the fact that other systems have chosen to offer different outputs in this situation suggests that users could benefit from SuperDB doing the same.

I happened to bump into this because running the ClickBench query 16 produces NULL values for some systems but the "nothing" for SuperDB, so my benchmark scripts that "diffs" CSV-based query outputs to make sure they're the same picked up on this as a material difference. I can put a workaround in my diff script to treat these as equivalent, but a change to SuperDB itself may also be justified.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions