-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Intern tasks 2020/2021 #15065
Description
This is the list of proposed tasks. It is to be extented. You can propose more tasks.
You can also find the previous list here: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
The tasks should be:
- not too hard (doable within about a month) but usually not less than a week;
- not alter some core components of the system;
- mostly isolated, does not require full knowledge of the system;
- somewhat interesting to implement or have some point of research;
- not in critical path of our roadmap (ok to be throwed away after a year);
- most of them are for C++ developers, but there should be also tasks for frontend developers or tools/research that only require Go/Python/whatever;
- some tasks should allow team work;
- cover various skills, e.g. system programming, algorithm knowledge, etc...
🗑️ Advanced methods of test coverage calculation with LLVM compiler infrastructure
This topic is booked by Michael Kot @myrrc.
Cancelled.
We want to calculate test coverage for each single test (we have about 2500 functional tests). It will allow to answer questions like: what tests cover this file / part of code / function; what tests are the most relevant for the code (something like TF*IDF metric); what code is covered by this test; what is the most relevant/specific code for this test...
The task is challenging, because the default format for generated test coverage data is too heavy (sparse), flushing and analyzing it for every test is too heavy. But LLVM compiler infrastructure has tools to implement our own custom coverage (-fsanitize=coverage, -fxray-instrument).
As an extension of this task, we can also implement lightweight runtime tracing (or tracing profiler).
🚀 PostgreSQL table engine
Done
This topic is booked by Ksenia Sumarokova @kssenii.
ClickHouse can interact and process data from various data sources via table functions and table engines. We already have multitude of them: ODBC, JDBC, MySQL, MongoDB, URL... With ODBC table function it's possible to talk to any ODBC compatible DBMS, including PostgreSQL. But it is less efficient and less convenient than using native PostgreSQL driver to interact with PostgreSQL.
The task is to add native support for PostgreSQL. Interesting detail is to add proper support for Array data types that PostgreSQL has. We should also implement support for PostgreSQL as dictionary source. As an extension to this task we can also investigate performance issues with Poco::ODBC client library and replace it to nanodbc. If everything will be alright we can also consider implementing replication from PostgreSQL to ClickHouse like pg2ch does.
⌛ Efficient reading of subcolumns from tables; flexible serialization formats for a single data type
Review stage, almost done.
This topic is booked by Anton Popov @CurtizJ
See #14196
🚀 Implementation of SQL/JSON in ClickHouse
Done.
Booked by @l1tsolaiki, @ltybc-coder
Modern SQL:2016 standard describes support for querying and managing JSON data with SQL. It's quite sophisticated - it includes mini-language like JSON Path.
ClickHouse already has support for querying JSON with simdjson library. This library has support for JSON Pointer API. But it does not match SQL/JSON. We have to parse and interpret the SQL/JSON language and map it to simdjson API.
🚀 Table constraints and hypothesis on data for query optimization
Booked by Nikita Vasilev @nikvas0
Reviewed by @CurtizJ
ClickHouse has support for table constraints, e.g. URLDomain = domain(URL) or isValidUTF8(Title). Constraints are expressions that are checked on data insertion. We can also use constraints for query optimization. Example: if there is a constraint that URLDomain = domain(URL) and there is an expression domain(URL) in the query, we can assume that constraint is true and replace domain(URL) to URLDomain if it will be easier to read and calculate. Another example: simply replace isValidUTF8(Title) to 1.
We can implement support for other two notions similar to constraints: "assumptions" and "hypothesis". "Assumption" is similar to constraint: if the user write ASSUMPTION URLDomain = domain(URL) in the table definition, we don't check it on insert but still use it for query optimization (like constraint). "Hypothesis" is an expression that is checked on insertion but it's permitted to be false. Instead we will store the result: whether the hypothesis hold or not - as very lightweight index. This index can be used for query optimization when hypothesis was hold.
🚀 Schema inference for text formats
Done.
Booked by Igor Baliuk, @lodthe
Continued by @Avogar
Review stage.
Given first chunk of data in TSV, CSV or JSON formats, figure out what table structure (data types) is the most appropriate for this data. Various tweaks and heuristics will be involved.
🚀 Advanced compression methods in ClickHouse
The task is done with about 50% finished.
Booked by Abi Palagashvili (MSU)
ClickHouse has support for LZ4 and ZSTD as generic compression methods. The choice of these particular methods is justified: these methods are pareto-optimal for compression level and speed across well known libraries. Nevertheless, there exist less well known compression libraries that can be somewhat better in certain cases. Among potentially faster there are: Lizard, LZSSE, density. Among more strong there are: bsc, csc, lzham. The task is to try and explore these libraries, integrate them to ClickHouse, make a comparison on various datasets.
Extensions to this task: reseach zlib compatible libraries (we have zlib-ng but it has unsatisfactoring quality); add support for Content-Encoding: zstd in HTTP interface and #8828
⌛ Integration of ClickHouse with Tensorflow
Intermediate stage, the status is questionable.
Booked by Albert @Provet.
Pre-learned models can be plugged in to ClickHouse and made available as functions. We have similar feature for Catboost.
🚀 Integration of streaming data sketches in ClickHouse
The task is done with about 25% finished.
Booked by Ivan Novitskiy, @RedClusive
Data sketches (also known as probablistic data structures) are data structures that can give approximate answer while using less memory or computation than with precise answer. We already have implemented the most demanded data sketches in ClickHouse: we have four variants of approx. count distinct and several variants of approx. quantiles. But there are much more unexplored interesting data structures worth trying.
🚀 Data processing with external tools in streaming fashion
Intermediate stage, will continue.
Booked by Kilill Shyrma, @ruct
Done by @kitaisreal
The user may write a program that will accept streams of serialized data to stdin (or multiple streams at several file descriptors), process the data and provide serialized result to stdout. We can allow to use these programs as table function. Table function may accept several SQL queries as arguments, prepare file descriptors, connect them with the program and pipe serialized data into them. This is similar to "map" in "mapreduce" model. It is intented for complex calculations that cannot be expressed in SQL.
There are various options how these programs can run: preinstalled programs available on server (easy part); third-party programs on blob-storage (s3, HDFS) that must be run in constrained environment (Linux namespaces, seccomp...)
Caching of deserialized data in memory on MergeTree part level
Available
Implement a new caching layer. If data part is read as a whole (all rows but maybe subset of columns), cache deserialized blocks in memory. This will make performance of MergeTree tables the same as Memory tables.
Extension to this task is to research various cache eviction algorithms.
Limited support for correlated subqueries in ClickHouse
Available
Figure out the subset of correlated subqieries that can be rewritten to JOINs and implement support for them via query rewrite on AST level.
🚀 Implementation of subquery operators in ClickHouse
Done.
Booked by Kirill Ershov.
Continued by @kssenii
Implement INTERSECT, EXCEPT and UNION DISTINCT operators (easy part). Then implement comparison with ANY/ALL of subquery and EXISTS subquery.
🚀 Implementation of GROUPING SETS in ClickHouse
Review stage.
Booked by Maksim Sipliviy, @MaxTheHuman.
Continued by @KochetovNicolai
GROUPING SETS is the way to perform multiple different aggregations in a single pass within a single query.
🗑️ Refreshable materialized views and cron jobs in ClickHouse
Cancelled.
Booked by Artem Starantsov.
⌛ User defined data types in ClickHouse
Intermediate stage.
+ User defined functions with SQL expressions.
Booked by Andrei Staroverov @Realist007
Continued by @kitaisreal
Limited support for unique key constraint
Available
Unique key constraint assures that there is only one row for some user defined unique key. BTree + in memory Hash Table + Bloom Filter can be used as a data structure for deduplication. It is very difficult to implement proper support for unique key constraint for replicated tables. But it can be implemented for non-replicated MergeTree and for ReplicatedMergeTree in local fashion (data is deduplicated only if inserted to the same replica) - it will have some limited use.
🚀 YAML configuration for ClickHouse
Done.
Booked by Denis Bolonin, @BoloniniD.
Some people hate XML. Let's support YAML for configurations files, so XML and YAML can be used interchangingly (for example, main config can remain in XML and config.d files can be provided in YAML). There should be a mapping from YAML to XML features like attributes.
🚀 Improvements for data formats and the clickhouse-local tool
Low percentage of completeness, will continue.
Booked by Egor Savin @Amesaru
Done by @kssenii
Output in CapNProto format. Proper support for Arrays in Parquet format. Allow to multiple read from stdin in clickhouse-local if stdin is seekable #11124. Interactive mode in clickhouse-local.
⌛ Incremental data aggregation in memory
Review stage.
Booked by Arthur Petukhovsky.
Continued by @KochetovNicolai
ClickHouse already has support for incremental aggregation (see AggregatingMergeTree). We can provide an alternative way that can sustain higher query rate, can be used for JOINs and dictionaries efficiently, in price of lost persistency.
When ClickHouse executes GROUP BY it creates data structure in memory to hold intermediate data for aggregation. This data structure only lives for query time and is destroyed when query finished. But we can hold aggregation data in memory and allow to incrementally feed more data into it and also allow to query it as key-value table / JOIN with it / use it as a dictionary. Typical usage example is antifraud filter that need to accumulate some statistics to filter data.
🚀 Natural language processing functions in ClickHouse
Done.
Booked by Nikolai Degterinskiy.
Add functions for text processing: lemmatization, stop word filtering, normalization, synonims extension. Look for Elasticsearch and Sphinxsearch for examples.
Embedded log viewer in ClickHouse
Available
A task for frontend developer. Create a single page application that will allow to quickly navigate, search and filter through ClickHouse system.text_log and system.query_log tables. The main goal is to make the interface lightweight, beautiful and neat.
🚀 Implementation of a table engine to consume application log files in ClickHouse
Done.
Booked by Flynn @ucasfl
ClickHouse has support for subscription and streaming data consumption from message queues: Kafka, RabbitMQ and also recently, from MySQL replication log. But the most simple example of streaming data - is append-only log on local filesystem. We don't have support to subscribe and consume logs from simple append-only file (generated by some third-party application) and it's possible to implement. With this feature, ClickHouse can be used as a replacement to Logstash.
🚀 Collection of common system metrics in ClickHouse
Done.
Booked by Egor Levankov, Vyacheslav Lukomski.
Continued by @alexey-milovidov
ClickHouse has deep introspection capabilities: per-query and per-thread metrics, sampling profiler, etc. But they are mostly metrics about ClickHouse itself. There's lack of metrics related to the server as a whole (e.g. total CPU load in system, amount of free memory, load average, network traffic...).
Usually there is no need to have these metrics in ClickHouse, because they are collected by separate monitoring agents. But there are several reasons why it's better for ClickHouse to collect metrics by itself:
- sometimes people are using ClickHouse but forgot to install any system monitoring at all;
- sometimes ClickHouse is run in some managed cloud service where it's not possible to install our own agents and the capabilities of default monitoring option is unsufficient;
- ClickHouse is good as time series database - it can store metrics with superior precision, for more time range and with better efficiency, it opens more possibilities for in-depth data analysis on top of metrics data.
There are some excellent examples of metric collection software (e.g. Netdata: https://github.com/netdata/netdata). Unfortunately, the code license of most of them is GPL - it means that we have to write our own metrics collection code.
🚀 Integration of S2 geometry library in ClickHouse
Done.
Booked by Andrey Che, @Andr0901
Continued by @alexey-milovidov
S2 is a library for geospatial data processing with space-filling curves. ClickHouse already has support for another library with similar concept (H3 from Uber - hierarchical coordinate system with hexagons).
The choice between these libraries is motivated by which library is already used inside a company. It means that we have no that choice in ClickHouse and it's better to support both.
SQL functions for compatibility with MySQL dialect
The task is done with 50% completeness.
Booked by Daniil Kondratyev, @dankondr
ClickHouse has very rich set of functions that are available out of the box. It's mostly superior than what you will find in other DBMS. Our functions are more performant, more consistent in behaviour and usually have better naming and usability.
Also we have a practice to add compatibility aliases for functions from other DBMS - so the functions will be available under their foreign names. It is possible to have compatibility for almost every function from MySQL.
🗑️ Data formats for fast import of nested JSON and XML
Initial stage.
Booked by Sergey Polukhin @SDPolukhin
We already have support for importing data in JSONEachRow format (flat JSON, a separate object for every row, a.k.a jsonlines). But when JSON contains deeply nested fields and we want to map subset from them to a table, the import become cumbersome. Also we don't have any means for XML import.
Example of complex nested JSON: https://www.gharchive.org/
Example of complex nested XML: https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
The proposal is to add a format that will allow the user to specify:
- which object (or path in JSON/XML) we should treat as a record;
- what paths within this object will map to what fields in a table.
When multiple elements are matched, we can map them to Array in ClickHouse.
🚀 Efficient text classification in ClickHouse
Review stage.
Booked by Sergey Katkovskiy @s-kat
Add functions for text classification. They can use bag of words / character n-grams / word shingles models. Simple bayes models can be used. The data for models can be provided in static data files.
Example applications:
- detect charset;
- detect language;
- roughly categorize topic.
The main challenge is to make classification functions as efficient as possible to be applicable to massive datasets for on the fly processing (ClickHouse style).
🚀 Reducing tail latency for distributed queries in ClickHouse
Done.
Booked by @Avogar
⌛ Minimal support for transactions in MergeTree tables
Intermediate stage.
Booked by @tavplubix
🚀 Data encryption on VFS level
Review stage.
Booked by Alexandra Latysheva @alexelex
NEAR modifier for GROUP BY
Unknown status.
Booked by Philipp Malkovsky, @malkfilipp (MIPT)
It allows to aggregate data not by exact values of the keys but by clusters of values near to each other. Clusters are dynamically formed during aggregation.
Improvements of aggregate functions and array functions in ClickHouse
Available
Booked by Alexander Mustafin.
🚀 Porting ClickHouse SIMD optimizations to ARM NEON
Done in another year
🚀 Specialized precompression codecs in ClickHouse
Moved to autumn.
Booked by Aspandiyar.
Implemented after a year by another person.
🚀 Integration of SQLite as database engine and data format
Booked by Arslan Gumerov.
Done by @kssenii
⌛ Implementation of query cache for result datasets
In progress.
🚀 Support for INFORMATION SCHEMA in ClickHouse
Booked by Damir Petrov.
Done by @tavplubix
Application for GitHub with messenger interface
Available
Make a desktop application similar to Telegram Desktop that will represent all issues and pull requests from GitHub repositories as chats, sort them by update time, maintain unread count and highlight when the user was mentioned. This application is intented to allow answering questions very quickly without opening web pages in browser (that often takes multiple seconds).
App should work on Linux. C++ and QT can be used for implementation. Alternatively any other technologies can be used instead (Flutter, Electron, ...). The main requirements are: low resource consumption, low input latency, quick startup time, surprise-free behaviour.