Skip to content

Add new features to schema inference for JSON formats#54427

Merged
robot-clickhouse-ci-1 merged 18 commits intoClickHouse:masterfrom
Avogar:json-object-as-tuple-inference
Sep 27, 2023
Merged

Add new features to schema inference for JSON formats#54427
robot-clickhouse-ci-1 merged 18 commits intoClickHouse:masterfrom
Avogar:json-object-as-tuple-inference

Conversation

@Avogar
Copy link
Copy Markdown
Member

@Avogar Avogar commented Sep 7, 2023

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

This PRs improves schema inference from JSON formats:

  1. Now it's possible to infer named Tuples from JSON objects without experimantal JSON type under a setting input_format_json_try_infer_named_tuples_from_objects in JSON formats. Previously without experimantal type JSON we could only infer JSON objects as Strings or Maps, now we can infer named Tuple. Resulting Tuple type will conain all keys of objects that were read in data sample during schema inference. It can be useful for reading structured JSON data without sparse objects. The setting is enabled by default.
  2. Allow parsing JSON array into a column with type String under setting input_format_json_read_arrays_as_strings. It can help reading arrays with values with different types.
  3. Allow to use type String for JSON keys with unkown types (null/[]/{}) in sample data under setting input_format_json_infer_incomplete_types_as_strings. Now in JSON formats we can read any value into String column and we can avoid getting error Cannot determine type for column 'column_name' by first 25000 rows of data, most likely this column contains only Nulls or empty Arrays/Maps during schema inference by using type String for unknown types, so the data will be read successfully.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-feature Pull request with new product feature label Sep 7, 2023
@robot-ch-test-poll2
Copy link
Copy Markdown
Contributor

robot-ch-test-poll2 commented Sep 7, 2023

This is an automated comment for commit c68e008 with description of existing statuses. It's updated for the latest CI running

✅ Click here to open a full report in a separate page

Successful checks
Check nameDescriptionStatus
AST fuzzerRuns randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help✅ success
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR✅ success
ClickHouse build checkBuilds ClickHouse in various configurations for use in further steps. You have to fix the builds that fail. Build logs often has enough information to fix the error, but you might have to reproduce the failure locally. The cmake options can be found in the build log, grepping for cmake. Use these options and follow the general build process✅ success
Compatibility checkChecks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help✅ success
Docker image for serversThe check to build and optionally push the mentioned image to docker hub✅ success
Docs CheckBuilds and tests the documentation✅ success
Fast testNormally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here✅ success
Flaky testsChecks if new added or modified tests are flaky by running them repeatedly, in parallel, with more randomization. Functional tests are run 100 times with address sanitizer, and additional randomization of thread scheduling. Integrational tests are run up to 10 times. If at least once a new test has failed, or was too long, this check will be red. We don't allow flaky tests, read the doc✅ success
Install packagesChecks that the built packages are installable in a clear environment✅ success
Integration testsThe integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests✅ success
Mergeable CheckChecks if all other necessary checks are successful✅ success
Performance ComparisonMeasure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests✅ success
Push to DockerhubThe check for building and pushing the CI related docker images to docker hub✅ success
SQLTestThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success
SQLancerFuzzing tests that detect logical bugs with SQLancer tool✅ success
SqllogicRun clickhouse on the sqllogic test set against sqlite and checks that all statements are passed✅ success
Stateful testsRuns stateful functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stateless testsRuns stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc✅ success
Stress testRuns stateless functional tests concurrently from several clients to detect concurrency-related errors✅ success
Style CheckRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report✅ success
Unit testsRuns the unit tests for different release types✅ success
Upgrade checkRuns stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts✅ success

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Sep 7, 2023

@alexey-milovidov WDYT about enabling this new behaviour by default? I think it can cover most of the cases with json datasets.
Pros: with more or less structured json data we will infer correct Tuples with all elements in the data, so we won't need JSON type
Cons: with sparse json data resulting Tuples will contain not all elements (only elements we could read during schema inference), so we could read not all object keys and values

Right now by default we infer JSON objects as Strings (or Maps if all values in object have the same type) so it can be processed futher by the userwith functions JSONExtract*.

@alexey-milovidov
Copy link
Copy Markdown
Member

Yes, let's enable it by default. Looks good for me.

@alexey-milovidov
Copy link
Copy Markdown
Member

Only one test remains...

@alexey-milovidov
Copy link
Copy Markdown
Member

alexey-milovidov commented Sep 20, 2023

I've tested it on this file:
inference_test_1000.jsonl.zst.txt
and found that it works very slow (it didn't finish, still running).

Upd: finished in 128 seconds.

@Avogar Avogar marked this pull request as draft September 21, 2023 10:50
@Avogar Avogar changed the title Allow to infer named Tuples from JSON objects under a setting in JSON formats Add new features to schema inference for JSON formats Sep 25, 2023
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Sep 25, 2023

Example from Alexey with new optimized implementation:

avogar-dev :) desc file(inference_test_1000.jsonl)

DESCRIBE TABLE file(inference_test_1000.jsonl)

Query id: 0519e04b-62af-464b-a876-b8b0468a0ab5

┌─name─┬─type───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
│ ip   │ Nullable(String)                                                                                                                                                                                                                                           │              │                    │         │                  │                │
│ data │ Tuple(http Tuple(error Nullable(String), protocol Nullable(String), result Tuple(redirect_response_chain Array(Tuple(body Nullable(String), body_sha256 Nullable(String), content_length Nullable(Int64), headers Tuple(access_control_allow_origin Array(Nullable(String)), cache_control Array(Nullable(String)), connection Array(Nullable(String)), content_language Array(Nullable(String)), content_length Array(Nullable(Int64)), content_security_policy Array(Nullable(String)), content_type Array(Nullable(String)), date Array(Nullable(String)), expires Array(Nullable(String)), last_modified Array(Nullable(String)), location Array(Nullable(String)), pragma Array(Nullable(String)), server Array(Nullable(String)), set_cookie Array(Nullable(String)), unknown Array(Tuple(key Nullable(String), value Array(Nullable(String)))), upgrade Array(Nullable(String)), vary Array(Nullable(String)), via Array(Nullable(String)), x_frame_options Array(Nullable(String)), x_powered_by Array(Nullable(String)), x_xss_protection Array(Nullable(Int64))), protocol Tuple(major Nullable(Int64), minor Nullable(Int64), name Nullable(String)), request Tuple(headers Tuple(accept Array(Nullable(String)), referer Array(Nullable(String)), user_agent Array(Nullable(String))), host Nullable(String), method Nullable(String), url Tuple(host Nullable(String), path Nullable(String), raw_query Nullable(String), scheme Nullable(String))), status_code Nullable(Int64), status_line Nullable(String), transfer_encoding Array(Nullable(String)))), response Tuple(body Nullable(String), body_sha256 Nullable(String), content_length Nullable(Int64), headers Tuple(accept_ranges Array(Nullable(String)), access_control_allow_origin Array(Nullable(String)), age Array(Nullable(Int64)), alt_svc Array(Nullable(String)), cache_control Array(Nullable(String)), connection Array(Nullable(String)), content_language Array(Nullable(String)), content_length Array(Nullable(Int64)), content_location Array(Nullable(String)), content_security_policy Array(Nullable(String)), content_type Array(Nullable(String)), date Array(Nullable(String)), etag Array(Nullable(String)), expires Array(Nullable(String)), last_modified Array(Nullable(String)), link Array(Nullable(String)), p3p Array(Nullable(String)), pragma Array(Nullable(String)), public_key_pins Array(Nullable(String)), retry_after Array(Nullable(Int64)), server Array(Nullable(String)), set_cookie Array(Nullable(String)), status Array(Nullable(String)), strict_transport_security Array(Nullable(String)), unknown Array(Tuple(key Nullable(String), value Array(Nullable(String)))), upgrade Array(Nullable(String)), vary Array(Nullable(String)), via Array(Nullable(String)), www_authenticate Array(Nullable(String)), x_content_type_options Array(Nullable(String)), x_frame_options Array(Nullable(String)), x_powered_by Array(Nullable(String)), x_ua_compatible Array(Nullable(String)), x_xss_protection Array(Nullable(String))), protocol Tuple(major Nullable(Int64), minor Nullable(Int64), name Nullable(String)), request Tuple(headers Tuple(accept Array(Nullable(String)), referer Array(Nullable(String)), user_agent Array(Nullable(String))), host Nullable(String), method Nullable(String), tls_log Tuple(handshake_log Tuple(client_finished Tuple(verify_data Nullable(String)), client_key_exchange Tuple(ecdh_params Tuple(client_private Tuple(length Nullable(Int64), value Nullable(String)), client_public Tuple(x Tuple(length Nullable(Int64), value Nullable(String)), y Tuple(length Nullable(Int64), value Nullable(String))), curve_id Tuple(id Nullable(Int64), name Nullable(String))), rsa_params Tuple(encrypted_pre_master_secret Nullable(String), length Nullable(Int64))), key_material Tuple(master_secret Tuple(length Nullable(Int64), value Nullable(String)), pre_master_secret Tuple(length Nullable(Int64), value Nullable(String))), server_certificates Tuple(certificate Tuple(parsed Tuple(extensions Tuple(authority_info_access Tuple(issuer_urls Array(Nullable(String)), ocsp_urls Array(Nullable(String))), authority_key_id Nullable(String), basic_constraints Tuple(is_ca Nullable(Bool)), certificate_policies Array(Tuple(cps Array(Nullable(String)), id Nullable(String))), crl_distribution_points Array(Nullable(String)), extended_key_usage Tuple(client_auth Nullable(Bool), server_auth Nullable(Bool)), key_usage Tuple(digital_signature Nullable(Bool), key_encipherment Nullable(Bool), value Nullable(Int64)), signed_certificate_timestamps Array(Tuple(log_id Nullable(String), signature Nullable(String), timestamp Nullable(Int64), version Nullable(Int64))), subject_alt_name Tuple(dns_names Array(Nullable(String))), subject_key_id Nullable(String)), fingerprint_md5 Nullable(String), fingerprint_sha1 Nullable(String), fingerprint_sha256 Nullable(String), issuer Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String))), issuer_dn Nullable(String), names Array(Nullable(String)), redacted Nullable(Bool), serial_number Nullable(Float64), signature Tuple(self_signed Nullable(Bool), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), valid Nullable(Bool), value Nullable(String)), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), spki_subject_fingerprint Nullable(String), subject Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), jurisdiction_country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String)), serial_number Array(Nullable(String))), subject_dn Nullable(String), subject_key_info Tuple(ecdsa_public_key Tuple(b Nullable(String), curve Nullable(String), gx Nullable(String), gy Nullable(String), length Nullable(Int64), n Nullable(String), p Nullable(String), pub Nullable(String), x Nullable(String), y Nullable(String)), fingerprint_sha256 Nullable(String), key_algorithm Tuple(name Nullable(String)), rsa_public_key Tuple(exponent Nullable(Int64), length Nullable(Int64), modulus Nullable(String))), tbs_fingerprint Nullable(String), tbs_noct_fingerprint Nullable(String), validation_level Nullable(String), validity Tuple(end Nullable(String), length Nullable(Int64), start Nullable(String)), version Nullable(Int64)), raw Nullable(String)), chain Array(Tuple(parsed Tuple(extensions Tuple(authority_info_access Tuple(issuer_urls Array(Nullable(String)), ocsp_urls Array(Nullable(String))), authority_key_id Nullable(String), basic_constraints Tuple(is_ca Nullable(Bool), max_path_len Nullable(Int64)), certificate_policies Array(Tuple(cps Array(Nullable(String)), id Nullable(String))), crl_distribution_points Array(Nullable(String)), extended_key_usage Tuple(client_auth Nullable(Bool), server_auth Nullable(Bool)), key_usage Tuple(certificate_sign Nullable(Bool), crl_sign Nullable(Bool), digital_signature Nullable(Bool), value Nullable(Int64)), subject_key_id Nullable(String)), fingerprint_md5 Nullable(String), fingerprint_sha1 Nullable(String), fingerprint_sha256 Nullable(String), issuer Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String))), issuer_dn Nullable(String), redacted Nullable(Bool), serial_number Nullable(Float64), signature Tuple(self_signed Nullable(Bool), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), valid Nullable(Bool), value Nullable(String)), signature_algorithm Tuple(name Nullable(String), oid Nullable(String)), spki_subject_fingerprint Nullable(String), subject Tuple(common_name Array(Nullable(String)), country Array(Nullable(String)), locality Array(Nullable(String)), organization Array(Nullable(String)), organizational_unit Array(Nullable(String)), province Array(Nullable(String))), subject_dn Nullable(String), subject_key_info Tuple(fingerprint_sha256 Nullable(String), key_algorithm Tuple(name Nullable(String)), rsa_public_key Tuple(exponent Nullable(Int64), length Nullable(Int64), modulus Nullable(String))), tbs_fingerprint Nullable(String), tbs_noct_fingerprint Nullable(String), validation_level Nullable(String), validity Tuple(end Nullable(String), length Nullable(Int64), start Nullable(String)), version Nullable(Int64)), raw Nullable(String))), validation Tuple(browser_error Nullable(String), browser_trusted Nullable(Bool))), server_finished Tuple(verify_data Nullable(String)), server_hello Tuple(cipher_suite Tuple(hex Nullable(String), name Nullable(String), value Nullable(Int64)), compression_method Nullable(Int64), extended_master_secret Nullable(Bool), heartbeat Nullable(Bool), ocsp_stapling Nullable(Bool), random Nullable(String), secure_renegotiation Nullable(Bool), session_id Nullable(String), ticket Nullable(Bool), version Tuple(name Nullable(String), value Nullable(Int64))), server_key_exchange Tuple(digest Nullable(String), ecdh_params Tuple(curve_id Tuple(id Nullable(Int64), name Nullable(String)), server_public Tuple(x Tuple(length Nullable(Int64), value Nullable(String)), y Tuple(length Nullable(Int64), value Nullable(String)))), signature Tuple(raw Nullable(String), signature_and_hash_type Tuple(hash_algorithm Nullable(String), signature_algorithm Nullable(String)), tls_version Tuple(name Nullable(String), value Nullable(Int64)), type Nullable(String), valid Nullable(Bool))))), url Tuple(host Nullable(String), path Nullable(String), raw_query Nullable(String), scheme Nullable(String))), status_code Nullable(Int64), status_line Nullable(String), transfer_encoding Array(Nullable(String)))), status Nullable(String), timestamp Nullable(String))) │              │                    │         │                  │                │
└──────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

2 rows in set. Elapsed: 0.114 sec.

@Avogar Avogar marked this pull request as ready for review September 25, 2023 15:59
@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Sep 26, 2023

AST fuzzer (ubsan) - #55016

@antonio2368 antonio2368 self-assigned this Sep 26, 2023

/// If we have Map and Object(JSON) types, convert all Map types to Object(JSON).
/// If we have Map types with different value types, convert all Map types to Object(JSON)
void transformMapsAndObjectsToObjects(DataTypes & data_types, TypeIndexesSet & type_indexes)
Copy link
Copy Markdown
Member Author

@Avogar Avogar Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed logic of experimental JSON type inference. Previously if setting allow_experimantal_object_type was enabled, we used JSON type for object only if we couldn't infer it as Map(String, ValueType). And I think this logic was not good, no need to use Map type at all in this case, now we always use JSON type for objects during inference if allow_experimantal_object_type is enabled.

@Avogar
Copy link
Copy Markdown
Member Author

Avogar commented Sep 26, 2023

@antonio2368 I would appreciate if you could continue reviewing this PR today/tomorrow morning, so I can fix everything and we can merge it before release

Copy link
Copy Markdown
Member

@antonio2368 antonio2368 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are mostly cosmetics, I'll approve after they are applied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants