Schema inference (RFC)

### Level 1:

If the format already contains a schema (Parquet, Avro, Protobuf, Native, TSVWithNamesAndTypes), allow to use it to derive
- the schema argument of table functions file, hdfs, url, s3...
- the structure argument of `clickhouse-local` tool;
- columns definition in CREATE TABLE statement;

Use case: just don't repeat the schema if we already have it in format.


### Level 2 (troublesome):

If the format does not contain the schema, derive "best effort" schema from the data in buffer.
If the format contains names of the columns (e.g. CSVWithNames), use these names. Otherwise use `_1`, `_2`... as column names.
Various tweaks can be used to adjust the derived schema. For example: always treat values as Nullable even if they already present in the first chunk of data in buffer. Or: use Float64 for numbers instead of Int64 / UInt64. Or: treat everything as String. Or: only extract data from subpath in JSON.

Use case:
See https://github.com/dinedal/textql
Provide a similar tool for those who like ClickHouse SQL capabilities.
Also look the discussion of the huge list of similar tools:
https://news.ycombinator.com/item?id=16781294
https://news.ycombinator.com/item?id=7175830
Look at the subthread: https://news.ycombinator.com/item?id=16782355


### How to implement:

I'm not sure that this will work.

If inference is requested:
- make a constructor of the format without the `header` argument;
- start reading data in constructor of the format (fetch first chunk of data into buffer) and derive the header from the first chunk of data in buffer.

Alllow to execute `CREATE TABLE t ENGINE = ...` without specifying the list of columns. It will mean that the list of columns will be automatically constructed in Storage constructor. Provide a Storage constructor without the list of columns and register these constructors in StorageFactory for storages that allow schema inference.

This will be also used to construct ReplicatedMergeTree tables when we add new replicas of existing table.
Merge, Distributed, Buffer will also benefit from that although `CREATE TABLE ... AS` syntax is available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema inference (RFC) #14450

Level 1:

Level 2 (troublesome):

How to implement:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Schema inference (RFC) #14450

Description

Level 1:

Level 2 (troublesome):

How to implement:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions