Skip to content

Schema inference (RFC) #14450

@alexey-milovidov

Description

@alexey-milovidov

Level 1:

If the format already contains a schema (Parquet, Avro, Protobuf, Native, TSVWithNamesAndTypes), allow to use it to derive

  • the schema argument of table functions file, hdfs, url, s3...
  • the structure argument of clickhouse-local tool;
  • columns definition in CREATE TABLE statement;

Use case: just don't repeat the schema if we already have it in format.

Level 2 (troublesome):

If the format does not contain the schema, derive "best effort" schema from the data in buffer.
If the format contains names of the columns (e.g. CSVWithNames), use these names. Otherwise use _1, _2... as column names.
Various tweaks can be used to adjust the derived schema. For example: always treat values as Nullable even if they already present in the first chunk of data in buffer. Or: use Float64 for numbers instead of Int64 / UInt64. Or: treat everything as String. Or: only extract data from subpath in JSON.

Use case:
See https://github.com/dinedal/textql
Provide a similar tool for those who like ClickHouse SQL capabilities.
Also look the discussion of the huge list of similar tools:
https://news.ycombinator.com/item?id=16781294
https://news.ycombinator.com/item?id=7175830
Look at the subthread: https://news.ycombinator.com/item?id=16782355

How to implement:

I'm not sure that this will work.

If inference is requested:

  • make a constructor of the format without the header argument;
  • start reading data in constructor of the format (fetch first chunk of data into buffer) and derive the header from the first chunk of data in buffer.

Alllow to execute CREATE TABLE t ENGINE = ... without specifying the list of columns. It will mean that the list of columns will be automatically constructed in Storage constructor. Provide a Storage constructor without the list of columns and register these constructors in StorageFactory for storages that allow schema inference.

This will be also used to construct ReplicatedMergeTree tables when we add new replicas of existing table.
Merge, Distributed, Buffer will also benefit from that although CREATE TABLE ... AS syntax is available.

Metadata

Metadata

Assignees

Labels

comp-formatsInput/output formats (CSV/JSON/Parquet/ORC/Arrow/Protobuf/etc.).feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions