Intern Tasks 2024/2025

This is the list of proposed tasks. It is to be extended. You can propose more tasks.
You can also find the previous lists here:

2023/2024: https://github.com/ClickHouse/ClickHouse/issues/58394
2022/2023: https://github.com/ClickHouse/ClickHouse/issues/42194
2021/2022: https://github.com/ClickHouse/ClickHouse/issues/29601
2020/2021: https://github.com/ClickHouse/ClickHouse/issues/15065
2019/2020: https://gist.github.com/alexey-milovidov/4251f71275f169d8fd0867e2051715e9
2018/2019: https://gist.github.com/alexey-milovidov/6735d193762cab1ad3b6e6af643e3a43
2017/2018: https://gist.github.com/alexey-milovidov/26cc3862eb87e52869b9dac64ab99156

The tasks should be:

- not too hard (doable within about a month) but usually not less than a week;
- not alter some core components of the system;
- mostly isolated, does not require full knowledge of the system;
- somewhat interesting to implement or have some point of research;
- not in the critical path of our roadmap (ok to be thrown away after a year);
- most of them are for C++ developers, but there should also be tasks for frontend developers or tools/research that only require Go/Python/whatever;
- some tasks should allow teamwork;
- cover various skills, e.g., system programming, algorithm knowledge, etc...


## Instrumentation with LLVM XRay

When the code is compiled with LLVM XRay, it adds a space in entry and exit of each large function (filled with nop instructions), that can be patched with callbacks in runtime. This has a low runtime overhead and can be used for production builds, so they can be traced when needed. LLVM already provides various callbacks for global application tracing. But we want to integrate these capabilities into ClickHouse, e.g. - to enable and disable tracing with SYSTEM queries, to dynamically add logs or failpoints (e.g. sleep for a certain time in a certain function), write tracing information into system tables, per-query tracing, and augmentation of traces with information about the query, implementing code coverage in production, etc.

This topic will be interesting for people who like low-level programming and compilers.

Issue with more details: https://github.com/ClickHouse/ClickHouse/issues/74249

Taken by: @htuall

## Probabilistic data structures for filtering

The idea of this task is to provide a probabilistic alternative for the IN (subquery) operator using bloom filter, counting bloom filter (to check for elements likely appeared multiple times), cuckoo filter, quotient filter, vacuum filter, and to compare all these algorithms.

The applications are cohort analysis and antifraud.

Taken by: @afigor2701

PR: #79648

## Drivers for User-Defined Functions

Description: https://github.com/ClickHouse/ClickHouse/issues/71172

PR: https://github.com/ClickHouse/ClickHouse/pull/77128

Taken by: @dahbka-lis

## Vortex

Implement support for [Vortex](https://vortex.dev) as an import and output [format](https://clickhouse.com/docs/sql-reference/formats).

Contact: @rschu1ze 

Assigned to @NicholasTerek

## WebAssembly and JS isolates for UDFs

We will try to integrate a JavaScript interpreter to use v8 isolates directly in ClickHouse.

PR: https://github.com/ClickHouse/ClickHouse/pull/88747

Taken by: @vdimir and @lioshik

## Git as a database engine

ClickHouse has a notion of database engines - they are responsible for enumerating the tables and managing their creation, deletion and metadata changes.

Let's implement a table engine which will connect to a Git repository and interpret a list of .sql files as the list of tables. Changes to the database will be derived from git history.

Taken by: @ilukatsk

## Integration with AWS SQS and AWS Kinesis

Subscription to streaming data sources similar to the existing engines for Kafka and RabbitMQ.

Taken by: @kazalika

## Implementation of Arrow Flight SQL and ADBC

https://github.com/ClickHouse/ClickHouse/issues/7554

Taken by: @zakr600

## Fuzzing ClickHouse

Explore new and creative ways for fuzz testing, such as:
- custom data mutators in coverage-guided fuzzers
- structure-aware fuzzing
- LLM-assisted mutators
- exploring unusual entrypoints for fuzzing (such as file and network related system calls)

## Keeper on shared storage

Allow using high-latency storage with Keeper by batching requests for hundreds of miliseconds. This allows to run it on stateless machines, which could be created and destroyed on demand.

https://github.com/ClickHouse/ClickHouse/issues/55329

## Remote database and table engine

https://github.com/ClickHouse/ClickHouse/issues/59304

## Improvements for the URL engine

https://github.com/ClickHouse/ClickHouse/issues/45994

Taken by: @yistarostin

## Functions for quantizations

Implement functions for packing and unpacking vectors (arrays with floating-point numbers) into/from FixedStrings with quantized (coarsened) elements, such as 16, 8, 4-bit floats, switched float representations (when the number of exponent and mantissa bits are determined by another bit), integers and bits. Implement functions to calculate (approximate) vector distances (e.g. L2/cosine distance) on top of quantized FixedStrings.

These functions are needed for search in vector embeddings.

Contact: @rschu1ze 

Topic taken by: @nikita4109

PR:
- https://github.com/ClickHouse/ClickHouse/pull/77018
- https://github.com/ClickHouse/ClickHouse/pull/76761

## Cuckoo-filter skip indexes

The skip indexes in ClickHouse are based on min-max statistics, bloom filters or unique sets. Cuckoo filters provide an interesting trade-offs - they are more compact and precise than normal bloom filter indexes. The task is to implement a cuckoo-filter skip index based on [this paper](https://www.vldb.org/pvldb/vol13/p3559-kipf.pdf).

Contact: @rschu1ze 

Taken by @ksuleimanova

## Support for graphics rendering as output formats

https://github.com/ClickHouse/ClickHouse/issues/57834

Assigned to: @m7kss1

## Improvements for table functions and federated queries

Allow a table function, such as mysql/postgres to include an explicit query for the external data source.
Extend push-down capabilities to certain query clauses such as LIMIT.

https://github.com/ClickHouse/ClickHouse/issues/46758

Taken by: @the-devid

## Decoupling memory buffers and columns

Description: https://github.com/ClickHouse/ClickHouse/issues/40222

This task will be interesting for those who like low-level programming.

Taken by: @georgyshamteev 

## Framing formats for query results

This task is to solve two problems:
- how to send a query result intermixed with the progress information, query metrics, and similar data;
- how to send multiple resultsets in a single response;

The idea is to introduce "framing formats" which encapsulate various chunks of responses.
One of which will be `application/event-stream` for the HTTP interface.

Taken by: @h0tmi @ilyaleshchik

## Improvements for clickhouse client

https://github.com/ClickHouse/ClickHouse/issues/65388
https://github.com/ClickHouse/ClickHouse/issues/65252
https://github.com/ClickHouse/ClickHouse/issues/30274
https://github.com/ClickHouse/ClickHouse/issues/70323
https://github.com/ClickHouse/ClickHouse/issues/65549
https://github.com/ClickHouse/ClickHouse/issues/63093
https://github.com/ClickHouse/ClickHouse/issues/47800
and various other improvements.

Taken by: @aaaengel

## Integration of random forest learning

Implement regression and classification models directly in ClickHouse using a table engine, so that when you insert into a table, it learns the model. It will require integration of a library like LightGBM, XGBoost, Catboost, or an algorithm behind these libraries, directly into ClickHouse.

Taken by: @HazyMrf

## Proxy mode for query routing

Implement a tool named clickhouse-proxy, which connects to other ClickHouse servers, and routes queries and responses according to various criteria such as: - user name, - database name, - host name from the Host field or TLS's SNI field, - or even query patterns. It should be capable of routing all supported ClickHouse protocols in a resource-constrained fashion.

Taken by: @X-OrBit

## ClickHouse in WebAssembly

Compile ClickHouse to run inside a web browser and make sure there is anything meaningful.


## Streaming inserts and routed inserts

One part is https://github.com/ClickHouse/ClickHouse/issues/41439

Another part is to implement a handler capable of inserting data into multiple tables, where the table name is extracted from the data (e.g. from a certain JSON field). It can create tables automatically using inferred or pre-defined schema.

## Low Code application platform with ClickHouse

This task is for JavaScript and full-stack engineers. Implement a lightweight single-page application builder, consisted of UI components (such as text inputs, buttons and selectors), data visualization widgets (such as tables and charts), pages with layouts, tabs, menus; event triggers and timers, with reactive dependencies and simple scriptable actions. The platform should not use any JavaScript frameworks and any backend databases other than ClickHouse. Compare with Retool, Tooljet, Appsmith, Budibase, Illa.

An attempt: https://github.com/ClickHouse/ClickHouse/issues/71175#issuecomment-2949530624

## Dynamic format schemas

Small task

https://github.com/ClickHouse/ClickHouse/issues/21096


## Completeness of the implementation of PostgreSQL wire protocol

Add support for ~~TLS~~ (<-- done), prepared statements, and ~~COPY command~~ (<-- done, https://github.com/ClickHouse/ClickHouse/pull/74344) into the PostgreSQL compatibility protocol in ClickHouse.

Taken by: @scanhex12


## Improvements for aggregate functions

https://github.com/ClickHouse/ClickHouse/issues/70801
https://github.com/ClickHouse/ClickHouse/issues/59832

Taken by: @deymon-d

## Optimization with hypotheses on functional dependencies between columns

At insertion time, let's automatically derive if a certain column can be fully calculated or partially reuse information from other columns.
For example, let's suppose there is a column `user_id UInt64` and `user_url String`, and it appears that user_url always equal to `concat('https://upyachka.com/users/', user_id)`. We will introduce a way to quickly check these type of dependencies and how to encode them in the data storage.

Taken by: @azakarka

## Improvements for projections

Support for projections with where, distinct, array join, support for compression, and secondary indices. #74234

Taken by: @tigran-edu

## Partial function application and pipelining

Introduce a few syntax extensions to ClickHouse:

Function pipelining: `f | g ...`

`(f | g)(x)` - the same as `g(f(x))`

Function binding:

`f(c, _)` - the same as `x -> f(c, x)`

This allows more convenient syntax for higher-order functions, e.g.
`arrayMap(extractTextFromHTML | decodeHTMLComponent | substring(_, 100), tags)`

Taken by: @TheIvanYes

## Embedded documentation

Implement a built-in handler in clickhouse-server, which returns an HTML page with auto-generated reference documentation for settings, functions, table engines, etc.


## Functions for XML and HTML processing

Implement functions for CSS selectors or a subset of XPath or a way to apply JSON paths to XML or HTML data.

Taken by: @Asm0deusss

## Unification of URL-like functions

https://github.com/ClickHouse/ClickHouse/issues/59617

Taken by: @ryulal01

## Set and Join tables as indices

Allow a MergeTree table to be configured with an index for fast key-value lookups.

Taken by: @Jow1e

## Text classification with ngram models

Implement loading language models based on ngrams in the server. A model is represented by a map `(ngram, class) -> count`. Add a function, given the model name and an input string, to return the most likely class or the probability distribution of classes using the naive Bayes classifier.

This might sound archaic, but it will work decently for tasks such as charset and language detection, as well as style classification.

See previous attempts: #50247 and #48984

Contact: @rschu1ze

Topic taken by: @nihalzp

## ALP Compression for floating point data

Implement the ALP compression codec which achieves much higher compression rates than the existing floating point codecs in ClickHouse (FPC and Gorilla) while being more than an order of magnitude faster (compression + decompression) than ZSTD.

Please re-implement the codec from scratch and do not integrate the [original implementation](https://github.com/cwida/ALP).

See: https://github.com/ClickHouse/ClickHouse/issues/60533

Contact: @rschu1ze

Taken by: @stevebuildboats

## Graph and ontology analysis on ClickHouse

This task is quite hard.

Implement an interpreter for SPARQL to process RDF data in ClickHouse or an interpreter for Cypher for graph data.
Example dataset is here: https://www.wikidata.org/wiki/Wikidata:Database_download

Taken by: @SteveBalayanAKAMedian

## Large scale Internet topology analysis

This is the task for data analysis. Research project for exploring the IPv4 routing and connectivity dynamics - finding patterns that emerge in a large dataset of ICMP responses across the IPv4 range from multiple vantage points. These patterns could include finding connectivity problems and instabilities, finding anycast addresses, finding suboptimal and indirect traffic routes, estimating the global topological structure of the Internet, and exploring patterns between various types of networks, such as infrastructure, residential, and corporate.

Note: we already collected around 10 trillion data points.


## Large scale Internet technology survey

This is for a full-stack engineer or data analyst.

This task does not require C++ knowledge and is implemented mostly in SQL.

Given a dataset of 1 TB of HTML pages (ClickHouse's Minicrawl) or 200 TB of HTML pages (CommonCrawl) do research around technologies used on the web and prepare a dataset with extracted metadata. Similar examples: https://w3techs.com/, https://builtwith.com/, https://httparchive.org/

You can check the preparation and current results in this task: #18842

The goal is to build a mini web app. The app will allow answering questions like - which JavaScript frameworks are the most popular on the Internet (the correct answer is - JQuery); and how their popularity changes depending on the website's popularity, country, SSL certificate vendor, etc. There is a boatload of information, but take care - if this topic is not interesting for you, it's unlikely you will make a good app.

Assigned to: @antontarasenko

## Binary files visualization and statistical analysis

Examples:

https://codisec.com/binary-visualization-explained/
http://binvis.io/#/

In this project we will apply binary visualization to a large corpus of binaries, such as: - every debian package, - every python package, - every popular Docker image.
It could be combined with statistical analysis of machine instructions and registers usage.

Taken by: @Frunze49

## Tracing and visualization of memory accesses

Build an instrumented version of ClickHouse (`-fsanitize=coverage`) which tracks every read and write memory access. These will be recorded into a table for visualization. We will use animations with space-filling curves to understand the memory locality and cache locality of the running application.

Taken by: @Owltop

## Fuzzing with Snapchange

https://github.com/awslabs/snapchange

This tool allows to pause and replay the process, exploiting multiple possible paths of its runtime. Let's apply it for fuzzing ClickHouse, e.g. trying different results for certain syscalls.

Taken by: @psevdoinsaf

## Index advisor

Users must currently choose primary and skipping indexes manually, e.g. by looking at which predicates in their queries will be costly. It would be better if ClickHouse could suggest candidate index columns automatically and show them in a system view. Most algorithms for automatic index selection are based on what-if cost estimation in the query optimizer, given some workload. Since we don't have cost estimation in the optimizer yet, we can try to make a recommendation from the number of scanned rows (simplest approach) or use some light-weight statistics about output cardinalities of scans (during a collection phase). Further references: [paper1](https://dbgroup.cs.tsinghua.edu.cn/ligl/papers/index-eab-vldb24.pdf) [here](https://www.vldb.org/pvldb/vol13/p2382-kossmann.pdf)

The task is suitable as a bachelor or master thesis and will be conducted in close collaboration with a developer in the ClickHouse database team.

Taken by: @FArthur-cmd

Code: https://github.com/FArthur-cmd/ClickHouse/pull/4

Contact: @rschu1ze 


## Pipelined SQL syntax

https://github.com/ClickHouse/ClickHouse/issues/69364

Contact: @rschu1ze 

Topic taken by: @sinfillo


## SMASH String Similarity

[SMASH](https://www.vldb.org/pvldb/vol17/p4104-tang.pdf) is a new string similarity metric (similar to the edit distance) which is robust towards acronyms: It recognizes that "ceo" equals "chief executive officer"). Earlier techniques could also do that but they required dictionaries, whereas SMASH uses dynamic programming to find matching terms. The task is to add SMASH as a string similarity function to ClickHouse.

Contact: @rschu1ze

Topic taken by: @Vlad

PR: https://github.com/ClickHouse/ClickHouse/pull/79461

## Import and Export of Substrait plans

[Substrait](https://substrait.io/) is an emerging representation of relational algebra. Its purpose is to decouple and modularize complex data systems. For example, it enables ClickHouse to execute computation on data described by a Substrait plan generated by another system. Similarly, ClickHouse might pushdown calculations into dependent databases using Substrait plans (e.g. [Sirius](https://www.sirius-db.com/)) or utilize external optimizers for query optimization (e.g. [optd](github.com/cmu-db/optd)). The task is to produce Substrait plans (similar to today's `EXPLAIN PLAN`) and consume Substrait plans (using a new function `runSubstrait`).

Assigned to: @Sreyas2401 (https://github.com/ClickHouse/ClickHouse/issues/71175#issuecomment-2954465255)

Contact: @rschu1ze

## Prefix-compressed dictionaries

Most column stores provide some kind of dictionary/domain encoding, oftentimes it is even the default. This compression schema maps each unique column value to a value id and replaces the column values by value ids. ClickHouse implements this in the `LowCardinality` type. The approach works particularly well if a column has only few unique values.

To use dictionary compression also if there is a moderate or high number of unique values, additional tricks are needed to prevent that the dictionary blows up in size. One simple technique for string dictionaries is front coding. In this method, the dictionary values are sorted and then each string is encoded as a pair (longest common prefix with its predecessor, suffix). We could add additional "checkpoints" with uncompressed strings to make random access faster.

Contact: @rschu1ze

Topic taken by: @Krytog, @cidann

PR: #80356

## Improvements for query caching

- https://github.com/ClickHouse/ClickHouse/issues/51656 Query cache for subqueries.
- https://github.com/ClickHouse/ClickHouse/issues/58228 Query cache on disk.
- https://github.com/ClickHouse/ClickHouse/issues/52141 Query cache for partial results. 

Contact: @rschu1ze 

Topic taken by: @nbarannik

Intern Tasks 2024/2025 #71175

Description

Instrumentation with LLVM XRay

Probabilistic data structures for filtering

Drivers for User-Defined Functions

Vortex

WebAssembly and JS isolates for UDFs

Git as a database engine

Integration with AWS SQS and AWS Kinesis

Implementation of Arrow Flight SQL and ADBC

Fuzzing ClickHouse

Keeper on shared storage

Remote database and table engine

Improvements for the URL engine

Functions for quantizations

Cuckoo-filter skip indexes

Support for graphics rendering as output formats

Improvements for table functions and federated queries

Decoupling memory buffers and columns

Framing formats for query results

Improvements for clickhouse client

Integration of random forest learning

Proxy mode for query routing

ClickHouse in WebAssembly

Streaming inserts and routed inserts

Low Code application platform with ClickHouse

Dynamic format schemas

Completeness of the implementation of PostgreSQL wire protocol

Improvements for aggregate functions

Optimization with hypotheses on functional dependencies between columns

Improvements for projections

Partial function application and pipelining

Embedded documentation

Functions for XML and HTML processing

Unification of URL-like functions

Set and Join tables as indices

Text classification with ngram models

ALP Compression for floating point data

Graph and ontology analysis on ClickHouse

Large scale Internet topology analysis

Large scale Internet technology survey

Binary files visualization and statistical analysis

Tracing and visualization of memory accesses

Fuzzing with Snapchange

Index advisor

Pipelined SQL syntax

SMASH String Similarity

Import and Export of Substrait plans

Prefix-compressed dictionaries

Improvements for query caching

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions