Skip to content

RFC: Semistructured Columns #54864

@alexey-milovidov

Description

@alexey-milovidov

(This is an alternative to the experimental JSON data type, aiming to replace it and address its drawbacks).

Implementation proposal

1. Dynamic columns.

A table (IStorage) can tell that it supports dynamic columns (bool canHaveDynamicColumns()).
If this is the case, a query can reference columns or subcolumns, not necessarily contained in the table's definition.
The table should provide a method (e.g., getColumnType) to dynamically obtain the type of a column or subcolumn (or expression?).

As a bonus and a demo, we can allow a variant of Merge table that will dynamically extend if new tables appear.
This Merge table can even have an empty table structure (no fixed columns).

2. Variant data type and column.

A new column type works as a discriminated union of nested columns. For example, Variant(Int8, Array(String)) has every value either Int8 or Array(String). It is serialized to multiple streams: a stream for every option and a stream with option numbers (discriminator).

This column can be accessed in the following ways:

  • with type conversion: c::Int8: we will read every subcolumn and convert if needed.
  • with tuple-like access: c.Int8: we will read only the requested subcolumn as Nullable (this is unneccessarily).
  • with implicit conversion: c: if the column is requested as is, we will find the least common type of the variant types (and throw an exception if it does not exist) and convert it.

3. JSON data type.

A column with JSON data type works as an opaque placeholder for dynamic subcolumns.

It is serialized into multiple streams: the first one is an index, enumerating every unique path in JSON, along with their types, and streams for every variant type for every path.

A user can insert JSON as String - then we convert it to JSON by deriving all data types of all paths as variants.

Different types inside variants are not converted on merges. Merges can only extend the variant data types by collecting more types inside them. The types are only converted on INSERT and SELECT.

If there is a JSON data type in a table, the table should tell that it supports dynamic columns. Querying any column not present in the static schema will trigger reading the "index" stream of every JSON type on query analysis. This is done lazily (on demand), but tables can subsequently cache this information inside in-memory information about data parts.

Distributed and Merge tables, as well as Views can trigger additional calls when asked about dynamic columns.

Column-level RBAC does not apply to subcolumns.

4. Explicit conversions.

When a user referenced a subcolumn from JSON with explicit type conversion, e.g. SELECT json.CounterID::UInt32 - the table will tell that the column json.CounterID::UInt32 exists and has UInt32 type without looking inside the data. If a user referenced it on table creation, e.g., CREATE TABLE ... ORDER BY json.CounterID::UInt32, the expression should be successfully resolved.

If the column json.CounterID::UInt32 is requested, it should be propagated to reading, and the table will return the data already converted to UInt32, or throw an exception.

5. Limitations on the number of subcolumns.

We will not introduce additional part formats and will use the wide or compact format as usual.

But if the number of paths in JSON is larger than the limit specified in the table setting (e.g., 1000), the remaining path prefixes will be recollected into JSON and written as String. The first paths are selected by the first-come order.

We should implement an optimization for vertical merge to support it for subcolumns, and most likely, it will be a special case in the code.

6. Hints.

The JSON data type can have parameters, specifying hints for data types of particular subpaths, or the fact that a particular subpath should not be written, e.g., JSON(http.version.major UInt8, SKIP body.raw, SKIP tls.handshake.random). We could also implement support for ALTERs.

Motivation

If we will implement only №1 and №2 and fail to implement the JSON data type, we will still get a lot.

The current implementation of the JSON data type is experimental, and it is not too late to remove it. It has the following drawbacks:

  • eager creation of subcolumns: all subcolumns have to be present during query analysis, instead of only queried ones, and it slows the queries; having dynamic subcolumns should solve this problem;
  • eager type unification: for example, if a field was an integer in JSON, and you inserted a single object where it is a string, the table will be poisoned forever; having variant data type and the support for explicit type specification should solve this problem;
  • eager type unification leads to corner cases with empty objects, nulls, unification between arrays of different dimensions, etc. - these cases are either not supported or have bugs;
  • it's not possible to use subcolumns from JSON in table indices or reference to them in other columns;
  • if the number of unique paths in JSON grows too much, it becomes unusable;
  • it duplicates the logic of schema inference, but we can reuse it more;

See also

Amazon ION data format and PartiQL query language.
Sneller query engine and its zION data format.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions