Add new features to schema inference for JSON formats#54427
Add new features to schema inference for JSON formats#54427robot-clickhouse-ci-1 merged 18 commits intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit c68e008 with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
|
@alexey-milovidov WDYT about enabling this new behaviour by default? I think it can cover most of the cases with json datasets. Right now by default we infer JSON objects as Strings (or Maps if all values in object have the same type) so it can be processed futher by the userwith functions JSONExtract*. |
|
Yes, let's enable it by default. Looks good for me. |
|
Only one test remains... |
|
I've tested it on this file: Upd: finished in 128 seconds. |
|
Example from Alexey with new optimized implementation: |
|
|
|
|
||
| /// If we have Map and Object(JSON) types, convert all Map types to Object(JSON). | ||
| /// If we have Map types with different value types, convert all Map types to Object(JSON) | ||
| void transformMapsAndObjectsToObjects(DataTypes & data_types, TypeIndexesSet & type_indexes) |
There was a problem hiding this comment.
I changed logic of experimental JSON type inference. Previously if setting allow_experimantal_object_type was enabled, we used JSON type for object only if we couldn't infer it as Map(String, ValueType). And I think this logic was not good, no need to use Map type at all in this case, now we always use JSON type for objects during inference if allow_experimantal_object_type is enabled.
|
@antonio2368 I would appreciate if you could continue reviewing this PR today/tomorrow morning, so I can fix everything and we can merge it before release |
antonio2368
left a comment
There was a problem hiding this comment.
Comments are mostly cosmetics, I'll approve after they are applied
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
This PRs improves schema inference from JSON formats:
input_format_json_try_infer_named_tuples_from_objectsin JSON formats. Previously without experimantal type JSON we could only infer JSON objects as Strings or Maps, now we can infer named Tuple. Resulting Tuple type will conain all keys of objects that were read in data sample during schema inference. It can be useful for reading structured JSON data without sparse objects. The setting is enabled by default.input_format_json_read_arrays_as_strings. It can help reading arrays with values with different types.null/[]/{}) in sample data under settinginput_format_json_infer_incomplete_types_as_strings. Now in JSON formats we can read any value into String column and we can avoid getting errorCannot determine type for column 'column_name' by first 25000 rows of data, most likely this column contains only Nulls or empty Arrays/Mapsduring schema inference by using type String for unknown types, so the data will be read successfully.Documentation entry for user-facing changes