Roadmap 2021 (discussion)

This is ClickHouse roadmap 2021.
Descriptions and links to be filled.
It will be published in documentation in December.

# Main tasks

### ✔️ Provide alternative for ZooKeeper

Implementation of a server with ZooKeeper interface inside ClickHouse.

Done, @alesapin 

#15090 #16877 #19580 #20585 #21425 #21677 #21593 #21690 #22274 #26150 #28981 #31150 #30880 #30678 #30372 #30170 #29417 #29367 #29268 #29223 #29071 #29030 #28526 #28519 #28360 #28152 #28190 #28197 #28143 #28080 #27818 #27125 #26874 #25428 #25421 #24533 #24499 #24448 #24412 #24059 #24017 #23077 #23038 #22992 #22743 #22707 #22470 #22373 #22274 #21677

### ✔️ Nested and semistructured data

In progress, @CurtizJ

Reading of subcolumns from tables. Nested type with arbitrary nesting level. Unify Nested and named tuples. Better support for nested and named tuples in syntax. Naturally map Nested to JSON format. Map datatype. Move ser/de methods from DataType to Column. Allow different column representations to store the same DataType. ColumnSparse. Codec inference from data. Dynamic columns in tables.

#21562
#21157
#21699
#21562
#14196
#17310
#14963
#15806
#1841

### Limited support for transactions

Atomic inserts into table and all dependent materialized views. Atomic inserts of more than one block.
Acquire a snapshot to use in multiple SELECT queries.
In progress, @tavplubix

#22086

### ✔️ Backups

#13953
@vitlibar
#21945

### ✔️ Hedged requests

#19291
Done, @Avogar

### ✔️ Window functions

Experimental support, @akuzm
#18222
#18455
#19022
#19299
#19921
#19951
#20041
#20060
#20111
#20284
#20293
#20337
#21895

### ✔️ Separation of storage and compute

✔️ Object storage for Replicated tables: #16240
✔️ Support for partitions in file-like engines
✔️ Distributed INSERT and SELECT over file-like engines, @nikitamikhaylov #22012
✔️ Remove ugliness and general inefficiencies from reading from remote storage.
Remote filesystem over ClickHouse server
✔️ Distributed SELECT over MergeTree on shared filesystem, @nikitamikhaylov #29279

### ✔️ Short-circuit evaluation

Done, @Avogar 

### ✔️ Projections

Experimental stage.
#20202
@amosbird

### ALTER PRIMARY KEY

In progress, @amosbird 

### ✔️ Lightweight DELETE/UPDATE

#24755 

### ✔️ Workload management

Add async method for processors. Shared thread pool for all queries.
@kochetovnicolai

### ✔️ User-Defined Functions

Done, @kitaisreal 
SQL UDFs - done!
Executable UDFs - done!

### Simplify replication

### JOIN improvements

#18672
@vdimir 

### Embedded documentation

In progress, @FArthur-cmd 

### Pluggable auth with tokens


# Experimental and interns tasks

### :wastebasket: Calculation of test coverage on a per-query basis

Dropped.
@myrrc
#20539

### Limited support for correlated subqueries

Postponed.

### ✔️ PostgreSQL table engine.

Done.
@kssenii
#18554

### ✔️ Streaming replication from PostgreSQL.

#20470, Done.
@kssenii

### ✔️ Implement SQL/JSON standard.

Done, #24148

### ✔️ Table constraints and hypothesis on data for query optimization

Done, #18787, #31476
@nikvas0 

### ✔️ Schema inference for text formats

Done, @Avogar 

### :wastebasket: Advanced compression methods

Cancelled.

### :wastebasket: Integration of ClickHouse with Tensorflow

Cancelled.

### ✔️ Integration of more streaming data sketches in ClickHouse

Two new sketches are added.

### ✔️ Data processing with external tools in streaming fashion aka ClickHouse MR

Done @kitaisreal 

### :wastebasket: Caching of deserialized data in memory on MergeTree part level

Cancelled.

### ✔️ Subquery operators: INTERSECT/EXCEPT, ANY/ALL/EXISTS.

Done.

### ✔️ Implementation of GROUPING SETS.

In progress.

### ✔️ Refreshable materialized views and cron jobs.

In progress.

### User-defined data types

In progress.

### Limited support for unique key constraints.

### ✔️ YAML configuration

Done, scheduled for release in 21.7.
#21858
@BoloniniD

### Incremental data aggregation in memory

In progress.

### ✔️ Natural language processing functions

Done, @evillique.

### ✔️ Implementation of a table engine to consume application log files

Done, @ucasfl, @kssenii

### ✔️ Collection of common system metrics

Done, @alexey-milovidov

### ✔️ Integration of S2 geometry library

Done.

### SQL functions for compatibility with MySQL

A few functions were added. Review stage.

### Data formats for fast import of nested JSON and XML

In progress.

### ✔️ Text classification

Done, @evillique

### ✔️ Data encryption on-rest

Done, @alexelex, @vitlibar

### :wastebasket: NEAR modifier for GROUP BY

Cancelled.

### :wastebasket: Specialized precompression codecs

Moved to 2022.

### ✔️ Integration of SQLite as database engine and data format

Done.

### ✔️ Query cache for result datasets

Postponed.

### ✔️ Support for INFORMATION SCHEMA

Done by @tavplubix 

### :wastebasket: Arrow Flight interface

Cancelled.

### ✔️ Functions and data types for geospatial data

Experimental stage.

### ✔️ User-Agent parsing functions

#21694

### Integrate novel optimization for GROUP BY

### ✔️ Descriptive analysis of datasets

Done.

### :wastebasket: Learning of vector embeddings for table rows

Cancelled.

### :wastebasket: Userspace RAID

Postponed.

### ✔️ VFS over HDFS

#11058

### :wastebasket: Etcd instead of ZooKeeper

#17495 Cancelled.

### :wastebasket: GPU accelerated aggregate functions

nVidia
Cancelled.

### ✔️ Rewrite type inference and identifiers analysis

E.g. a way to analyze this query
```
WITH b + 1 AS c
SELECT a AS b, *, t.*, n.b, a -> a = b + 1 AS func, arrayMap(func, n.c)
FROM mysql(...) RIGHT JOIN (SELECT ...) t ARRAY JOIN nest AS n
```
in a generic, not ad-hoc fashion.

In progress, @kitaisreal 


# Tech debt and small tasks

### ✔️ Fix low performance of encrypt/decrypt functions

Done. @alexey-milovidov 

### ✔️ Fix the remaining issues with in-memory parts and WAL

@CurtizJ 
We removed in-memory parts and WAL.

### ✔️ Continue to support play.clickhouse.com

There is no source code. The version of ClickHouse is too old. There are multiple bugs.
Or remove it completely. @qoega

### ✔️ Fix issues with Postgres via ODBC

Done @kssenii 

### ✔️ User roles from LDAP

Done. @traceon, @vitlibar

### ✔️ Remove DataStreams

Done, @kochetovnicolai

### :wastebasket: Incremental data clustering

Cancelled, @kochetovnicolai

### ✔️ Min-hash, Sim-hash support

Done. @kochetovnicolai, @alexey-milovidov 

### ✔️ Enable compile_expressions by default

Done. @kitaisreal 

### ✔️ Z-order indexing

In creeping progress.

### ✔️ Low performance of ser/de functions of DataType

Due to introduction of "Data type domains".

### ✔️ Library dictionary bridge

Done, @kssenii 
#21509

### ✔️ Versioning of aggregate function states

Done.
@kssenii 

### ✔️ Type conversions for IN, JOIN

Done. @vdimir 
#16724
#18672

### ✔️ Support for all types in CASE operator with values

### ✔️ Extended range for DateTime64

Done, @Enmk, @alexey-milovidov 
#9404

### ✔️ Improve logic of priorities of background merges

@nikitamikhaylov #22381
Done.

### ✔️ Better criteria for Too Many Parts

### ✔️ Speed-up ODBC table engine

Done, @kssenii 

### ✔️ Replace OpenSSL with BoringSSL

Done. @alexey-milovidov 
#16043 
#18129 

### Enable pk-aware GROUP BY by default

#19401

### ✔️ Deduplication for non-replicated MergeTree on block level

Done, @yuzhichang, @alesapin: #8467

### ✔️ Pre-configured named connections in config

To avoid specifying user/password for external storages.
Done, @kssenii 


# Testing improvements

### ✔️ Automated tests for AArch64 builds

#15174
#22534
#22580
#22582
#22590
#22595
#22596
#22632

### ✔️ Add Query Fuzzer for Stress Tests

Done.

### ✔️ Add Thread Fuzzer for flaky tests checking

Done, #18299

### Import obfuscated queries from Yandex.Metrica production

#29672

### Fuzzing of cluster configurations
### Fuzzing of ClickHouse versions for tests with distributed queries for compatibility

### ✔️ Integrate SQLancer

Done, @qoega 
#19006
#19077

But it is abandoned and does not work anymore.

### Integrate SQLLogicTest
#15112
#18706
#18701
#18707

### ✔️ More intense fuzzing of new added tests

Done, @alexey-milovidov 
#18916

### :wastebasket: Network replay server

Moved to next year.

### ✔️ Add PowerPC cross-builds
#25486
#30010

### ✔️ Add Darwin/AArch64 cross-builds

### ✔️ Ensure that no source files from OS are used during build
#18915
#29974
#30011

Roadmap 2021 (discussion) #17623

Description

Main tasks

✔️ Provide alternative for ZooKeeper

✔️ Nested and semistructured data

Limited support for transactions

✔️ Backups

✔️ Hedged requests

✔️ Window functions

✔️ Separation of storage and compute

✔️ Short-circuit evaluation

✔️ Projections

ALTER PRIMARY KEY

✔️ Lightweight DELETE/UPDATE

✔️ Workload management

✔️ User-Defined Functions

Simplify replication

JOIN improvements

Embedded documentation

Pluggable auth with tokens

Experimental and interns tasks

🗑️ Calculation of test coverage on a per-query basis

Limited support for correlated subqueries

✔️ PostgreSQL table engine.

✔️ Streaming replication from PostgreSQL.

✔️ Implement SQL/JSON standard.

✔️ Table constraints and hypothesis on data for query optimization

✔️ Schema inference for text formats

🗑️ Advanced compression methods

🗑️ Integration of ClickHouse with Tensorflow

✔️ Integration of more streaming data sketches in ClickHouse

✔️ Data processing with external tools in streaming fashion aka ClickHouse MR

🗑️ Caching of deserialized data in memory on MergeTree part level

✔️ Subquery operators: INTERSECT/EXCEPT, ANY/ALL/EXISTS.

✔️ Implementation of GROUPING SETS.

✔️ Refreshable materialized views and cron jobs.

User-defined data types

Limited support for unique key constraints.

✔️ YAML configuration

Incremental data aggregation in memory

✔️ Natural language processing functions

✔️ Implementation of a table engine to consume application log files

✔️ Collection of common system metrics

✔️ Integration of S2 geometry library

SQL functions for compatibility with MySQL

Data formats for fast import of nested JSON and XML

✔️ Text classification

✔️ Data encryption on-rest

🗑️ NEAR modifier for GROUP BY

🗑️ Specialized precompression codecs

✔️ Integration of SQLite as database engine and data format

✔️ Query cache for result datasets

✔️ Support for INFORMATION SCHEMA

🗑️ Arrow Flight interface

✔️ Functions and data types for geospatial data

✔️ User-Agent parsing functions

Integrate novel optimization for GROUP BY

✔️ Descriptive analysis of datasets

🗑️ Learning of vector embeddings for table rows

🗑️ Userspace RAID

✔️ VFS over HDFS

🗑️ Etcd instead of ZooKeeper

🗑️ GPU accelerated aggregate functions

✔️ Rewrite type inference and identifiers analysis

Tech debt and small tasks

✔️ Fix low performance of encrypt/decrypt functions

✔️ Fix the remaining issues with in-memory parts and WAL

✔️ Continue to support play.clickhouse.com

✔️ Fix issues with Postgres via ODBC

✔️ User roles from LDAP

✔️ Remove DataStreams

🗑️ Incremental data clustering

✔️ Min-hash, Sim-hash support

✔️ Enable compile_expressions by default

✔️ Z-order indexing

✔️ Low performance of ser/de functions of DataType

✔️ Library dictionary bridge

✔️ Versioning of aggregate function states

✔️ Type conversions for IN, JOIN

✔️ Support for all types in CASE operator with values