Skip to content

Intern Tasks 2024/2025 #71175

@alexey-milovidov

Description

@alexey-milovidov

This is the list of proposed tasks. It is to be extended. You can propose more tasks.
You can also find the previous lists here:

2023/2024: #58394
2022/2023: #42194
2021/2022: #29601
2020/2021: #15065
2019/2020: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
2018/2019: https://gist.github.com/alexey-milovidov/6735d193762cab1ad3b6e6af643e3a43
2017/2018: https://gist.github.com/alexey-milovidov/26cc3862eb87e52869b9dac64ab99156

The tasks should be:

  • not too hard (doable within about a month) but usually not less than a week;
  • not alter some core components of the system;
  • mostly isolated, does not require full knowledge of the system;
  • somewhat interesting to implement or have some point of research;
  • not in the critical path of our roadmap (ok to be thrown away after a year);
  • most of them are for C++ developers, but there should also be tasks for frontend developers or tools/research that only require Go/Python/whatever;
  • some tasks should allow teamwork;
  • cover various skills, e.g., system programming, algorithm knowledge, etc...

Instrumentation with LLVM XRay

When the code is compiled with LLVM XRay, it adds a space in entry and exit of each large function (filled with nop instructions), that can be patched with callbacks in runtime. This has a low runtime overhead and can be used for production builds, so they can be traced when needed. LLVM already provides various callbacks for global application tracing. But we want to integrate these capabilities into ClickHouse, e.g. - to enable and disable tracing with SYSTEM queries, to dynamically add logs or failpoints (e.g. sleep for a certain time in a certain function), write tracing information into system tables, per-query tracing, and augmentation of traces with information about the query, implementing code coverage in production, etc.

This topic will be interesting for people who like low-level programming and compilers.

Issue with more details: #74249

Taken by: @htuall

Probabilistic data structures for filtering

The idea of this task is to provide a probabilistic alternative for the IN (subquery) operator using bloom filter, counting bloom filter (to check for elements likely appeared multiple times), cuckoo filter, quotient filter, vacuum filter, and to compare all these algorithms.

The applications are cohort analysis and antifraud.

Taken by: @afigor2701

PR: #79648

Drivers for User-Defined Functions

Description: #71172

PR: #77128

Taken by: @dahbka-lis

Vortex

Implement support for Vortex as an import and output format.

Contact: @rschu1ze

Assigned to @NicholasTerek

WebAssembly and JS isolates for UDFs

We will try to integrate a JavaScript interpreter to use v8 isolates directly in ClickHouse.

PR: #88747

Taken by: @vdimir and @lioshik

Git as a database engine

ClickHouse has a notion of database engines - they are responsible for enumerating the tables and managing their creation, deletion and metadata changes.

Let's implement a table engine which will connect to a Git repository and interpret a list of .sql files as the list of tables. Changes to the database will be derived from git history.

Taken by: @Ilukatsk

Integration with AWS SQS and AWS Kinesis

Subscription to streaming data sources similar to the existing engines for Kafka and RabbitMQ.

Taken by: @kazalika

Implementation of Arrow Flight SQL and ADBC

#7554

Taken by: @zakr600

Fuzzing ClickHouse

Explore new and creative ways for fuzz testing, such as:

  • custom data mutators in coverage-guided fuzzers
  • structure-aware fuzzing
  • LLM-assisted mutators
  • exploring unusual entrypoints for fuzzing (such as file and network related system calls)

Keeper on shared storage

Allow using high-latency storage with Keeper by batching requests for hundreds of miliseconds. This allows to run it on stateless machines, which could be created and destroyed on demand.

#55329

Remote database and table engine

#59304

Improvements for the URL engine

#45994

Taken by: @yistarostin

Functions for quantizations

Implement functions for packing and unpacking vectors (arrays with floating-point numbers) into/from FixedStrings with quantized (coarsened) elements, such as 16, 8, 4-bit floats, switched float representations (when the number of exponent and mantissa bits are determined by another bit), integers and bits. Implement functions to calculate (approximate) vector distances (e.g. L2/cosine distance) on top of quantized FixedStrings.

These functions are needed for search in vector embeddings.

Contact: @rschu1ze

Topic taken by: @nikita4109

PR:

Cuckoo-filter skip indexes

The skip indexes in ClickHouse are based on min-max statistics, bloom filters or unique sets. Cuckoo filters provide an interesting trade-offs - they are more compact and precise than normal bloom filter indexes. The task is to implement a cuckoo-filter skip index based on this paper.

Contact: @rschu1ze

Taken by @ksuleimanova

Support for graphics rendering as output formats

#57834

Assigned to: @m7kss1

Improvements for table functions and federated queries

Allow a table function, such as mysql/postgres to include an explicit query for the external data source.
Extend push-down capabilities to certain query clauses such as LIMIT.

#46758

Taken by: @the-devid

Decoupling memory buffers and columns

Description: #40222

This task will be interesting for those who like low-level programming.

Taken by: @georgyshamteev

Framing formats for query results

This task is to solve two problems:

  • how to send a query result intermixed with the progress information, query metrics, and similar data;
  • how to send multiple resultsets in a single response;

The idea is to introduce "framing formats" which encapsulate various chunks of responses.
One of which will be application/event-stream for the HTTP interface.

Taken by: @h0tmi @ilyaleshchik

Improvements for clickhouse client

#65388
#65252
#30274
#70323
#65549
#63093
#47800
and various other improvements.

Taken by: @aaaengel

Integration of random forest learning

Implement regression and classification models directly in ClickHouse using a table engine, so that when you insert into a table, it learns the model. It will require integration of a library like LightGBM, XGBoost, Catboost, or an algorithm behind these libraries, directly into ClickHouse.

Taken by: @HazyMrf

Proxy mode for query routing

Implement a tool named clickhouse-proxy, which connects to other ClickHouse servers, and routes queries and responses according to various criteria such as: - user name, - database name, - host name from the Host field or TLS's SNI field, - or even query patterns. It should be capable of routing all supported ClickHouse protocols in a resource-constrained fashion.

Taken by: @X-OrBit

ClickHouse in WebAssembly

Compile ClickHouse to run inside a web browser and make sure there is anything meaningful.

Streaming inserts and routed inserts

One part is #41439

Another part is to implement a handler capable of inserting data into multiple tables, where the table name is extracted from the data (e.g. from a certain JSON field). It can create tables automatically using inferred or pre-defined schema.

Low Code application platform with ClickHouse

This task is for JavaScript and full-stack engineers. Implement a lightweight single-page application builder, consisted of UI components (such as text inputs, buttons and selectors), data visualization widgets (such as tables and charts), pages with layouts, tabs, menus; event triggers and timers, with reactive dependencies and simple scriptable actions. The platform should not use any JavaScript frameworks and any backend databases other than ClickHouse. Compare with Retool, Tooljet, Appsmith, Budibase, Illa.

An attempt: #71175 (comment)

Dynamic format schemas

Small task

#21096

Completeness of the implementation of PostgreSQL wire protocol

Add support for TLS (<-- done), prepared statements, and COPY command (<-- done, #74344) into the PostgreSQL compatibility protocol in ClickHouse.

Taken by: @scanhex12

Improvements for aggregate functions

#70801
#59832

Taken by: @deymon-d

Optimization with hypotheses on functional dependencies between columns

At insertion time, let's automatically derive if a certain column can be fully calculated or partially reuse information from other columns.
For example, let's suppose there is a column user_id UInt64 and user_url String, and it appears that user_url always equal to concat('https://upyachka.com/users/', user_id). We will introduce a way to quickly check these type of dependencies and how to encode them in the data storage.

Taken by: @Azakarka

Improvements for projections

Support for projections with where, distinct, array join, support for compression, and secondary indices. #74234

Taken by: @tigran-edu

Partial function application and pipelining

Introduce a few syntax extensions to ClickHouse:

Function pipelining: f | g ...

(f | g)(x) - the same as g(f(x))

Function binding:

f(c, _) - the same as x -> f(c, x)

This allows more convenient syntax for higher-order functions, e.g.
arrayMap(extractTextFromHTML | decodeHTMLComponent | substring(_, 100), tags)

Taken by: @TheIvanYes

Embedded documentation

Implement a built-in handler in clickhouse-server, which returns an HTML page with auto-generated reference documentation for settings, functions, table engines, etc.

Functions for XML and HTML processing

Implement functions for CSS selectors or a subset of XPath or a way to apply JSON paths to XML or HTML data.

Taken by: @Asm0deusss

Unification of URL-like functions

#59617

Taken by: @ryulal01

Set and Join tables as indices

Allow a MergeTree table to be configured with an index for fast key-value lookups.

Taken by: @Jow1e

Text classification with ngram models

Implement loading language models based on ngrams in the server. A model is represented by a map (ngram, class) -> count. Add a function, given the model name and an input string, to return the most likely class or the probability distribution of classes using the naive Bayes classifier.

This might sound archaic, but it will work decently for tasks such as charset and language detection, as well as style classification.

See previous attempts: #50247 and #48984

Contact: @rschu1ze

Topic taken by: @nihalzp

ALP Compression for floating point data

Implement the ALP compression codec which achieves much higher compression rates than the existing floating point codecs in ClickHouse (FPC and Gorilla) while being more than an order of magnitude faster (compression + decompression) than ZSTD.

Please re-implement the codec from scratch and do not integrate the original implementation.

See: #60533

Contact: @rschu1ze

Taken by: @stevebuildboats

Graph and ontology analysis on ClickHouse

This task is quite hard.

Implement an interpreter for SPARQL to process RDF data in ClickHouse or an interpreter for Cypher for graph data.
Example dataset is here: https://www.wikidata.org/wiki/Wikidata:Database_download

Taken by: @SteveBalayanAKAMedian

Large scale Internet topology analysis

This is the task for data analysis. Research project for exploring the IPv4 routing and connectivity dynamics - finding patterns that emerge in a large dataset of ICMP responses across the IPv4 range from multiple vantage points. These patterns could include finding connectivity problems and instabilities, finding anycast addresses, finding suboptimal and indirect traffic routes, estimating the global topological structure of the Internet, and exploring patterns between various types of networks, such as infrastructure, residential, and corporate.

Note: we already collected around 10 trillion data points.

Large scale Internet technology survey

This is for a full-stack engineer or data analyst.

This task does not require C++ knowledge and is implemented mostly in SQL.

Given a dataset of 1 TB of HTML pages (ClickHouse's Minicrawl) or 200 TB of HTML pages (CommonCrawl) do research around technologies used on the web and prepare a dataset with extracted metadata. Similar examples: https://w3techs.com/, https://builtwith.com/, https://httparchive.org/

You can check the preparation and current results in this task: #18842

The goal is to build a mini web app. The app will allow answering questions like - which JavaScript frameworks are the most popular on the Internet (the correct answer is - JQuery); and how their popularity changes depending on the website's popularity, country, SSL certificate vendor, etc. There is a boatload of information, but take care - if this topic is not interesting for you, it's unlikely you will make a good app.

Assigned to: @antontarasenko

Binary files visualization and statistical analysis

Examples:

https://codisec.com/binary-visualization-explained/
http://binvis.io/#/

In this project we will apply binary visualization to a large corpus of binaries, such as: - every debian package, - every python package, - every popular Docker image.
It could be combined with statistical analysis of machine instructions and registers usage.

Taken by: @Frunze49

Tracing and visualization of memory accesses

Build an instrumented version of ClickHouse (-fsanitize=coverage) which tracks every read and write memory access. These will be recorded into a table for visualization. We will use animations with space-filling curves to understand the memory locality and cache locality of the running application.

Taken by: @Owltop

Fuzzing with Snapchange

https://github.com/awslabs/snapchange

This tool allows to pause and replay the process, exploiting multiple possible paths of its runtime. Let's apply it for fuzzing ClickHouse, e.g. trying different results for certain syscalls.

Taken by: @psevdoinsaf

Index advisor

Users must currently choose primary and skipping indexes manually, e.g. by looking at which predicates in their queries will be costly. It would be better if ClickHouse could suggest candidate index columns automatically and show them in a system view. Most algorithms for automatic index selection are based on what-if cost estimation in the query optimizer, given some workload. Since we don't have cost estimation in the optimizer yet, we can try to make a recommendation from the number of scanned rows (simplest approach) or use some light-weight statistics about output cardinalities of scans (during a collection phase). Further references: paper1 here

The task is suitable as a bachelor or master thesis and will be conducted in close collaboration with a developer in the ClickHouse database team.

Taken by: @FArthur-cmd

Code: FArthur-cmd#4

Contact: @rschu1ze

Pipelined SQL syntax

#69364

Contact: @rschu1ze

Topic taken by: @sinfillo

SMASH String Similarity

SMASH is a new string similarity metric (similar to the edit distance) which is robust towards acronyms: It recognizes that "ceo" equals "chief executive officer"). Earlier techniques could also do that but they required dictionaries, whereas SMASH uses dynamic programming to find matching terms. The task is to add SMASH as a string similarity function to ClickHouse.

Contact: @rschu1ze

Topic taken by: @vlad

PR: #79461

Import and Export of Substrait plans

Substrait is an emerging representation of relational algebra. Its purpose is to decouple and modularize complex data systems. For example, it enables ClickHouse to execute computation on data described by a Substrait plan generated by another system. Similarly, ClickHouse might pushdown calculations into dependent databases using Substrait plans (e.g. Sirius) or utilize external optimizers for query optimization (e.g. optd). The task is to produce Substrait plans (similar to today's EXPLAIN PLAN) and consume Substrait plans (using a new function runSubstrait).

Assigned to: @Sreyas2401 (#71175 (comment))

Contact: @rschu1ze

Prefix-compressed dictionaries

Most column stores provide some kind of dictionary/domain encoding, oftentimes it is even the default. This compression schema maps each unique column value to a value id and replaces the column values by value ids. ClickHouse implements this in the LowCardinality type. The approach works particularly well if a column has only few unique values.

To use dictionary compression also if there is a moderate or high number of unique values, additional tricks are needed to prevent that the dictionary blows up in size. One simple technique for string dictionaries is front coding. In this method, the dictionary values are sorted and then each string is encoded as a pair (longest common prefix with its predecessor, suffix). We could add additional "checkpoints" with uncompressed strings to make random access faster.

Contact: @rschu1ze

Topic taken by: @Krytog, @cidann

PR: #80356

Improvements for query caching

Contact: @rschu1ze

Topic taken by: @nbarannik

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions