Skip to content

Output of toJSONString is not recognized as valid JSON #75017

@EmeraldShift

Description

@EmeraldShift

Company or project name

No response

Describe the unexpected behaviour

When a string contains ASCII characters whose code points are >=128, toJSONString returns a string that doesn't pass isValidJSON, and which raises an exception when trying to cast as ::JSON.

For example, the string unhex(hex(255)) is a single-character string, where that character has code point 255.

SELECT toJSONString(unhex(hex(255))) returns "�", which is a pair of quotes surrounding the raw character 0xFF.

SELECT toJSONString(unhex(hex(255))) | jq returns the same string, and exits successfully, so I want to believe this is valid JSON.

However, SELECT isValidJSON(toJSONString(unhex(hex(255)))) returns 0.

Also, SELECT toJSONString(unhex(hex(255)))::JSON raises an exception:

Code: 117. DB::Exception: Cannot parse JSON object here: "�": In scope SELECT CAST(toJSONString(unhex(hex(255))), 'JSON') SETTINGS enable_json_type = 1. (INCORRECT_DATA)

How to reproduce

$ clickhouse local --version
ClickHouse local version 24.12.3.47 (official build).

$ clickhouse local -q 'SELECT toJSONString(unhex(hex(255)))::JSON SETTINGS enable_json_type=1'
<ERROR, pasted below>

See also https://fiddle.clickhouse.com/2acc3321-c402-49dd-b203-cfbf468a38c2

Expected behavior

Either:

  • toJSONString escapes the ASCII characters, so the string only contains valid ASCII characters < 128
  • isValidJSON and ::JSON accept strings containing ASCII characters >=128

Error message and/or stacktrace

Code: 117. DB::Exception: Cannot parse JSON object here: "�": In scope SELECT CAST(toJSONString(unhex(hex(255))), 'JSON') SETTINGS enable_json_type = 1. (INCORRECT_DATA)

Additional context

I am working on a project where users provide key-value pairs, and the values can be arbitrary <=260 char strings. I receive these as arrays, so I am using

mapFromArrays(keys, values)::JSON

to convert them to JSON. However, this fails when a string contains these special characters, so I am unable to import the dataset and I cannot make progress.

I'm happy to implement a workaround that can sanitize these input strings, either manually escaping the special characters, or replacing them with nulls, or something else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    not plannedKnown issue, no plans to fix it currenltyunexpected behaviourResult is unexpected, but not entirely wrong at the same time.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions