Add new features to schema inference for JSON formats by Avogar · Pull Request #54427 · ClickHouse/ClickHouse

Avogar · 2023-09-07T19:44:27Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This PRs improves schema inference from JSON formats:

Now it's possible to infer named Tuples from JSON objects without experimantal JSON type under a setting input_format_json_try_infer_named_tuples_from_objects in JSON formats. Previously without experimantal type JSON we could only infer JSON objects as Strings or Maps, now we can infer named Tuple. Resulting Tuple type will conain all keys of objects that were read in data sample during schema inference. It can be useful for reading structured JSON data without sparse objects. The setting is enabled by default.
Allow parsing JSON array into a column with type String under setting input_format_json_read_arrays_as_strings. It can help reading arrays with values with different types.
Allow to use type String for JSON keys with unkown types (null/[]/{}) in sample data under setting input_format_json_infer_incomplete_types_as_strings. Now in JSON formats we can read any value into String column and we can avoid getting error Cannot determine type for column 'column_name' by first 25000 rows of data, most likely this column contains only Nulls or empty Arrays/Maps during schema inference by using type String for unknown types, so the data will be read successfully.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

… formats

robot-ch-test-poll2 · 2023-09-07T19:48:29Z

This is an automated comment for commit c68e008 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
CI running	A meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR	✅ success
ClickHouse build check	Builds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker image for servers	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs Check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Flaky tests	Checks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	✅ success
Mergeable Check	Checks if all other necessary checks are successful	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Push to Dockerhub	The check for building and pushing the CI related docker images to docker hub	✅ success
SQLTest	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
SQLancer	Fuzzing tests that detect logical bugs with SQLancer tool	✅ success
Sqllogic	Run clickhouse on the sqllogic test set against sqlite and checks that all statements are passed	✅ success
Stateful tests	Runs stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style Check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

Avogar · 2023-09-07T19:48:44Z

@alexey-milovidov WDYT about enabling this new behaviour by default? I think it can cover most of the cases with json datasets.
Pros: with more or less structured json data we will infer correct Tuples with all elements in the data, so we won't need JSON type
Cons: with sparse json data resulting Tuples will contain not all elements (only elements we could read during schema inference), so we could read not all object keys and values

Right now by default we infer JSON objects as Strings (or Maps if all values in object have the same type) so it can be processed futher by the userwith functions JSONExtract*.

alexey-milovidov · 2023-09-07T19:57:35Z

Yes, let's enable it by default. Looks good for me.

…ce from objects

alexey-milovidov · 2023-09-20T18:19:54Z

Only one test remains...

alexey-milovidov · 2023-09-20T21:34:31Z

I've tested it on this file:
inference_test_1000.jsonl.zst.txt
and found that it works very slow (it didn't finish, still running).

Upd: finished in 128 seconds.

Avogar · 2023-09-25T15:59:20Z

Example from Alexey with new optimized implementation:

avogar-dev :) desc file(inference_test_1000.jsonl)

DESCRIBE TABLE file(inference_test_1000.jsonl)

Query id: 0519e04b-62af-464b-a876-b8b0468a0ab5

┌─name─┬─type───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ ip   │ Nullable(String)                                                                                                                                                                                                                                           │              │                    │         │                  │                │
│ data │ Tuple(http Tuple(error Nullable(String), protocol Nullable(String), result Tuple(redirect_response_chain Array(Tuple(body Nullable(String), body_sha256 Nullable(String), content_length Nullable(Int64), headers Tuple(access_control_allow_origin Array(Nullable(String)), cache_control Array(Nullable(String)), connection Array(Nullable(String)), content_language Array(Nullable(String)), content_length Array(Nullable(Int64)), content_security_policy Array(Nullable(String)), content_type Array(Nullable(String)), date Array(Nullable(String)), expires Array(Nullable(String)), last_modified Array(Nullable(String)), location Array(Nullable(String)), pragma Array(Nullable(String)), server Array(Nullable(String)), set_cookie Array(Nullable(String)), unknown Array(Tuple(key Nullable(String), value Array(Nullable(String)))), upgrade Array(Nullable(String)), vary Array(Nullable(String)), via Array(Nullable(String)), x_frame_options Array(Nullable(String)), x_powered_by Array(Nullable(String)), x_xss_protection Array(Nullable(Int64))), protocol Tuple(major Nullable(Int64), minor Nullable(Int64), name Nullable(String)), request Tuple(headers Tuple(accept Array(Nullable(String)), referer Array(Nullable(String)), user_agent Array(Nullable(String))), host Nullable(String), method Nullable(String), url Tuple(host Nullable(String), path Nullable(String), raw_query Nullable(String), scheme Nullable(String))), status_code Nullable(Int64), status_line Nullable(String), transfer_encoding Array(Nullable(String)))), response Tuple(body Nullable(String), body_sha256 Nullable(String), content_length Nullable(Int64), headers Tuple(accept_ranges Array(Nullable(String)), access_control_allow_origin Array(Nullable(String)), age Array(Nullable(Int64)), alt_svc Array(Nullable(String)), cache_control Array(Nullable(String)), connection Array(Nullable(String)), content_language Array(Nullable(String)), content_length Array(Nullable(Int64)), content_location Array(Nullable(String)), content_security_policy Array(Nullable(String)), content_type Array(Nullable(String)), date Array(Nullable(String)), etag Array(Nullable(String)), expires Array(Nullable(String)), last_modified Array(Nullable(String)), link Array(Nullable(String)), p3p Array(Nullable(String)), pragma Array(Nullable(String)), public_key_pins Array(Nullable(String)), retry_after Array(Nullable(Int64)), server Array(Nullable(String)), set_cookie Array(Nullable(String)), status Array(Nullable(String)), strict_transport_security Array(Nullable(String)), unknown Array(Tuple(key Nullable(String), value Array(Nullable(String)))), upgrade Array(Nullable(String)), vary Array(Nullable(String)), via Array(Nullable(String)), www_authenticate Array(Nullable(String)), x_content_type_options Array(Nullable(String)), x_frame_options Array(Nullable(String)), x_powered_by Array(Nullable(String)), x_ua_compatible Array(Nullable(String)), x_xss_protection Array(Nullable(String))), protocol Tuple(major Nullable(Int64), minor Nullable(Int64), name Nullable(String)), request Tuple(headers Tuple(accept Array(Nullable(String)), referer Array(Nullable(String)), user_agent Array(Nullable(String))), host Nullable(String), method Nullable(String), tls_log Tuple(handshake_log Tuple(client_finished Tuple(verify_data Nullable(String)), client_key_exchange Tuple(ecdh_params Tuple(client_private Tuple(length Nullable(Int64), value Nullable(String)), client_public Tuple(x Tuple(length Nullable(Int64), value Nullable(String)), y Tuple(length Nullable(Int64), value Nullable(String))), curve_id Tuple(id Nullable(Int64), name Nullable(String))), rsa_params Tuple(encrypted_pre_master_secret Nullable(String), length Nullable(Int64))), key_material Tuple(master_secret Tuple(length Nullable(Int64), value Nullable(String)), pre_master_secret Tuple(length Nullable(Int64), value Nullable(String))), server_certificates Tuple(certificate Tuple(parsed Tuple(extensions Tuple(authority_info_access Tuple(issuer_urls Array(Nullable(String)), ocsp_urls Array(Nullable(String))), authority_key_id Nullable(String), basic_constraints Tuple(is_ca Nullable(Bool)), certificate_policies Array(Tuple(cps Array(Nullable(String)), id Nullable(String))), crl_distribution_points Array(Nullable(String)), extended_key_usage Tuple(client_auth Nullable(Bool), server_auth Nullable(Bool)), key_usage Tuple(digital_signature Nullable(Bool), key_encipherment Nullable(Bool), value Nullable(Int64)), signed_certificate_timestamps Array(Tuple(log_id Nullable(String), signature Nullable(String), timestamp Nullable(Int64), version Nullable(Int64))), subject_alt_name Tuple(dns_names Array(Nullable(String))), subject_key_id Nullable(String)), fingerprint_md5 Nullable(String), fingerprint_sha1 Nullable(String), fingerprint_sha256 Nullable(String), issuer Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String))), issuer_dn Nullable(String), names Array(Nullable(String)), redacted Nullable(Bool), serial_number Nullable(Float64), signature Tuple(self_signed Nullable(Bool), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), valid Nullable(Bool), value Nullable(String)), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), spki_subject_fingerprint Nullable(String), subject Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), jurisdiction_country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String)), serial_number Array(Nullable(String))), subject_dn Nullable(String), subject_key_info Tuple(ecdsa_public_key Tuple(b Nullable(String), curve Nullable(String), gx Nullable(String), gy Nullable(String), length Nullable(Int64), n Nullable(String), p Nullable(String), pub Nullable(String), x Nullable(String), y Nullable(String)), fingerprint_sha256 Nullable(String), key_algorithm Tuple(name Nullable(String)), rsa_public_key Tuple(exponent Nullable(Int64), length Nullable(Int64), modulus Nullable(String))), tbs_fingerprint Nullable(String), tbs_noct_fingerprint Nullable(String), validation_level Nullable(String), validity Tuple(end Nullable(String), length Nullable(Int64), start Nullable(String)), version Nullable(Int64)), raw Nullable(String)), chain Array(Tuple(parsed Tuple(extensions Tuple(authority_info_access Tuple(issuer_urls Array(Nullable(String)), ocsp_urls Array(Nullable(String))), authority_key_id Nullable(String), basic_constraints Tuple(is_ca Nullable(Bool), max_path_len Nullable(Int64)), certificate_policies Array(Tuple(cps Array(Nullable(String)), id Nullable(String))), crl_distribution_points Array(Nullable(String)), extended_key_usage Tuple(client_auth Nullable(Bool), server_auth Nullable(Bool)), key_usage Tuple(certificate_sign Nullable(Bool), crl_sign Nullable(Bool), digital_signature Nullable(Bool), value Nullable(Int64)), subject_key_id Nullable(String)), fingerprint_md5 Nullable(String), fingerprint_sha1 Nullable(String), fingerprint_sha256 Nullable(String), issuer Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String))), issuer_dn Nullable(String), redacted Nullable(Bool), serial_number Nullable(Float64), signature Tuple(self_signed Nullable(Bool), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), valid Nullable(Bool), value Nullable(String)), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), spki_subject_fingerprint Nullable(String), subject Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String))), subject_dn Nullable(String), subject_key_info Tuple(fingerprint_sha256 Nullable(String), key_algorithm Tuple(name Nullable(String)), rsa_public_key Tuple(exponent Nullable(Int64), length Nullable(Int64), modulus Nullable(String))), tbs_fingerprint Nullable(String), tbs_noct_fingerprint Nullable(String), validation_level Nullable(String), validity Tuple(end Nullable(String), length Nullable(Int64), start Nullable(String)), version Nullable(Int64)), raw Nullable(String))), validation Tuple(browser_error Nullable(String), browser_trusted Nullable(Bool))), server_finished Tuple(verify_data Nullable(String)), server_hello Tuple(cipher_suite Tuple(hex Nullable(String), name Nullable(String), value Nullable(Int64)), compression_method Nullable(Int64), extended_master_secret Nullable(Bool), heartbeat Nullable(Bool), ocsp_stapling Nullable(Bool), random Nullable(String), secure_renegotiation Nullable(Bool), session_id Nullable(String), ticket Nullable(Bool), version Tuple(name Nullable(String), value Nullable(Int64))), server_key_exchange Tuple(digest Nullable(String), ecdh_params Tuple(curve_id Tuple(id Nullable(Int64), name Nullable(String)), server_public Tuple(x Tuple(length Nullable(Int64), value Nullable(String)), y Tuple(length Nullable(Int64), value Nullable(String)))), signature Tuple(raw Nullable(String), signature_and_hash_type Tuple(hash_algorithm Nullable(String), signature_algorithm Nullable(String)), tls_version Tuple(name Nullable(String), value Nullable(Int64)), type Nullable(String), valid Nullable(Bool))))), url Tuple(host Nullable(String), path Nullable(String), raw_query Nullable(String), scheme Nullable(String))), status_code Nullable(Int64), status_line Nullable(String), transfer_encoding Array(Nullable(String)))), status Nullable(String), timestamp Nullable(String))) │              │                    │         │                  │                │
└──────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

2 rows in set. Elapsed: 0.114 sec.

…bject-as-tuple-inference

Avogar · 2023-09-26T11:18:40Z

AST fuzzer (ubsan) - #55016

docs/en/interfaces/formats.md

src/Formats/SchemaInferenceUtils.cpp

Avogar · 2023-09-26T11:52:56Z

src/Formats/SchemaInferenceUtils.cpp


-    /// If we have Map and Object(JSON) types, convert all Map types to Object(JSON).
-    /// If we have Map types with different value types, convert all Map types to Object(JSON)
-    void transformMapsAndObjectsToObjects(DataTypes & data_types, TypeIndexesSet & type_indexes)


I changed logic of experimental JSON type inference. Previously if setting allow_experimantal_object_type was enabled, we used JSON type for object only if we couldn't infer it as Map(String, ValueType). And I think this logic was not good, no need to use Map type at all in this case, now we always use JSON type for objects during inference if allow_experimantal_object_type is enabled.

src/DataTypes/Serializations/SerializationString.cpp

src/DataTypes/transformTypesRecursively.cpp

src/Formats/SchemaInferenceUtils.cpp

Avogar · 2023-09-26T18:27:49Z

@antonio2368 I would appreciate if you could continue reviewing this PR today/tomorrow morning, so I can fix everything and we can merge it before release

antonio2368

Comments are mostly cosmetics, I'll approve after they are applied

src/Formats/SchemaInferenceUtils.cpp

Allow to infer named Tuples from JSON objects under a setting in JSON…

ba307c7

… formats

robot-ch-test-poll2 added the pr-feature Pull request with new product feature label Sep 7, 2023

Avogar and others added 2 commits September 11, 2023 18:15

Enable setting by default, change logic for object/string/map inferen…

18a8e58

…ce from objects

Merge branch 'master' into json-object-as-tuple-inference

2c407ab

robot-clickhouse added the pr-status-⏳ label Sep 15, 2023

robot-clickhouse-ci-2 added pr-status-✅ and removed pr-status-⏳ labels Sep 15, 2023

Fix tests

2bd747d

robot-ch-test-poll3 added pr-status-⏳ and removed pr-status-✅ labels Sep 15, 2023

robot-ch-test-poll2 added pr-status-✅ and removed pr-status-⏳ labels Sep 15, 2023

robot-clickhouse-ci-2 added pr-status-⏳ and removed pr-status-✅ labels Sep 15, 2023

robot-ch-test-poll4 added pr-status-❌ and removed pr-status-⏳ labels Sep 15, 2023

Avogar and others added 2 commits September 18, 2023 17:15

Update tests

ed77893

Merge branch 'master' into json-object-as-tuple-inference

acadda1

robot-clickhouse-ci-1 added pr-status-⏳ and removed pr-status-❌ labels Sep 19, 2023

robot-ch-test-poll added pr-status-✅ and removed pr-status-⏳ labels Sep 19, 2023

Avogar marked this pull request as draft September 21, 2023 10:50

Avogar added 2 commits September 25, 2023 15:42

Better schema inference for JSON formats

42ca897

Small fixes

33a896e

Avogar changed the title ~~Allow to infer named Tuples from JSON objects under a setting in JSON formats~~ Add new features to schema inference for JSON formats Sep 25, 2023

Avogar marked this pull request as ready for review September 25, 2023 15:59

Avogar added 6 commits September 25, 2023 17:24

Merge branch 'master' of github.com:ClickHouse/ClickHouse into json-o…

9e75825

…bject-as-tuple-inference

Fix style and build

4d4e3db

Fix tests

95d50fd

Fix tests

cabb3dd

Fix style

6dcaa38

Fix test

c5f2d2a

antonio2368 self-assigned this Sep 26, 2023

Avogar commented Sep 26, 2023

View reviewed changes

docs/en/interfaces/formats.md Outdated Show resolved Hide resolved

Avogar commented Sep 26, 2023

View reviewed changes

src/Formats/SchemaInferenceUtils.cpp Outdated Show resolved Hide resolved

Avogar commented Sep 26, 2023

View reviewed changes

Avogar added 2 commits September 26, 2023 14:13

Fix comments

b6863a9

Merge branch 'master' into json-object-as-tuple-inference

bea80ab

antonio2368 reviewed Sep 26, 2023

View reviewed changes

Apply suggestions

6b3dbc4

antonio2368 reviewed Sep 27, 2023

View reviewed changes

Apply suggestions

c68e008

antonio2368 approved these changes Sep 27, 2023

View reviewed changes

robot-clickhouse-ci-1 merged commit 7a825c1 into ClickHouse:master Sep 27, 2023

yokofly mentioned this pull request Dec 18, 2025

Enable parsing of numbers as strings in JSON input formats and add regression tests timeplus-io/proton#1071

Merged

1 task

Conversation

Avogar commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

robot-ch-test-poll2 commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avogar commented Sep 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Sep 7, 2023

Uh oh!

alexey-milovidov commented Sep 20, 2023

Uh oh!

alexey-milovidov commented Sep 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avogar commented Sep 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avogar commented Sep 26, 2023

Uh oh!

Uh oh!

Uh oh!

Avogar Sep 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Avogar commented Sep 26, 2023

Uh oh!

antonio2368 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Avogar commented Sep 7, 2023 •

edited

Loading

robot-ch-test-poll2 commented Sep 7, 2023 •

edited

Loading

Avogar commented Sep 7, 2023 •

edited

Loading

alexey-milovidov commented Sep 20, 2023 •

edited

Loading

Avogar commented Sep 25, 2023 •

edited

Loading

Avogar Sep 26, 2023 •

edited

Loading