0% found this document useful (0 votes)

265 views14 pages

sqf6 Clickhouse Guide Sample

The document provides an overview of key concepts in ClickHouse including partition, primary key, order BY, and skip index. It explains how ClickHouse stores and queries data using these concepts. Partitioning allows fast data insertion and querying by partitioning the data into directories. Primary key and ordering allows efficient indexing and searching of the data.

Uploaded by

Giorgi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

265 views14 pages

sqf6 Clickhouse Guide Sample

Uploaded by

Giorgi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

A Practitioner’s Guide to ClickHouse

Yan Zeng, version 1.0, last updated on 2/25/2023

Downloadable at https://leanpub.com/sqf6-clickhouse-guide

Contents
Goals ............................................................................................................................................................. 2
References .................................................................................................................................................... 2
Row-Based DBMS vs. Column-Based DBMS ................................................................................................. 2
Sample Equity Tick Data................................................................................................................................ 3
Basic Statistics of One Day’s Tick Data .......................................................................................................... 3
Key Concepts in ClickHouse: Partition, Primary Key, Order BY, Skip Index .................................................. 4
Illustration of Data Storage in ClickHouse .................................................................................................... 5
Partition ........................................................................................................................................................ 5
Basics......................................................................................................................................................... 5
Directory creation, merging, and deletion................................................................................................ 5
Partition improves query performance .................................................................................................... 7
Key Takeaways for Partition ..................................................................................................................... 7
Primary Key, Order By ................................................................................................................................... 8
Sparse index to locate granules ................................................................................................................ 8
Build and use the primary index ............................................................................................................... 9
Use mark files.......................................................................................................................................... 11
Generic exclusion search algorithm ........................................................................................................ 11
Data skipping index: a secondary index to group and skip granules ...................................................... 12
Test, Test, Test! ........................................................................................................................................... 12
Additional Resources .................................................................................................................................. 13

1
Goals
• Provide a self-contained introduction to the inner working of ClickHouse
➢ How are data stored?
➢ How are data queried?
• Design a suitable table schema for equity tick data
• Intended audience: engineers implementing ClickHouse; quants using ClickHouse

References
• 朱凯：《ClickHouse 原理解析与应用实践》，机械工业出版社，2020.
• Vijay Anand: Up and Running with ClickHouse, BPB Publications, India, 2020.
• ClickHouse Official Documentation: https://clickhouse.com/docs/en/intro

Row-Based DBMS vs. Column-Based DBMS

• Row-based DBMS

• Column-based DBMS

2
Sample Equity Tick Data
• Data Source: https://firstratedata.com/tick-data
• Sample data: trades of AAPL and MSFT on 2020-01-02

• Table schema for trade data:

• Table creation for trade data

Basic Statistics of One Day’s Tick Data

• Based on Bloomberg BPIPE equity tick data (trades & quotes combined) on 2022-05-27
➢ ~19K distinct tickers
➢ ~136 million distinct time stamps
➢ ~595 million rows
➢ ~7.4 GB of compressed data and ~55.5 GB uncompressed data
➢ These numbers allow back-of-envelop estimation of query efficiency (see later)

3
• Equity tick data is huge, so that storage and queries need to be extremely efficient

Key Concepts in ClickHouse: Partition, Primary Key, Order BY, Skip Index
Conceptually,

• Partition: directory for physical storage of data

• Order By: sort rows by lexicographic order of sort keys
• Primary Key: for indexing data location
• Data Skipping Index: additional data indexing

Physically,

• Data are divided by “partitions” (directories)

• Within each partition, column data are stored separately in [Column].bin
• Rows are in lexicographic ascending order by the primary key columns (and the additional sort
key columns)
• Rows from different columns are matched via [Column].mrk

4
Illustration of Data Storage in ClickHouse
Using web browsing data as an illustration: UserID, URL, EventTime

• PRIMARY KEY (UserID, URL) ORDER BY (UserID, URL, EventTime)

• Web browsing data are sorted first by UserID, then by URL, and lastly be EventTime

Partition
Basics
• Data are written to disk simultaneously so that table insertion is fast
• As a result, multiple directories for the same partition are created, and then merged (10-15 min.
after insertion); old directories will then be deleted (~8 min. after merging)
• MinBlockNum, MaxBlockNum: global counter across partitions, increase by 1 if a new
partition directory is generated
• Level: the number of merging for a particular partition; local counter of “age”
• Example: directory “201905_1_1_0” is the first directory created for the partition “201905”

Directory creation, merging, and deletion

In the example below, month is used as the partition key for table “partition_v5”, e. g. 201905,
201906, etc.

5
6
Partition improves query performance
Back to Bloomberg equity tick data. Assume we have 3 years of daily data and we use date as the
partition key

• This will lead to about 250 * 3 = 750 partitions

• Partitioning indexing (minmax.idx) is triggered when the partition column “timestamp” is
used in the “WHERE” condition, allowing ClickHouse to skip many irrelevant partitions.
➢ SELECT * FROM equity_tickdata LIMIT 10 ⇒ full table scan, 750 partitions
will be scanned
➢ SELECT * FROM equity_tickdata WHERE timestamp >=
toDateTime64(‘2020-01-02 00:00:00.000000’,6) AND timestamp
<= toDateTime64(‘2020-01-02 23:59:59.000000’,6) LIMIT 10 ⇒
scan data for 01/02/2020, 1 partition will be scanned
• Number of partitions affects efficiency, up to 10x (Source: Altinity): month vs. date as partition
key

Key Takeaways for Partition

• A partition is a directory for physical storage of data
• Partition allows fast table insertion: multiple directories are created for the same partition, and
then merged and deleted
• Partition allows fast data query: when the column(s) for partitioning appears in WHERE statement,
partition indexing is triggered and only the relevant partitions are scanned for query result
• Number of partitions should not be too big: building and reading partition index files take time
and memory

7
Primary Key, Order By
Recall that web browsing data are sorted first by UserID, then by URL, and lastly by EventTime:
PRIMARY KEY (UserID, URL) ORDER BY (UserID, URL, EventTime)

• Ordered data storage allows for efficient search algorithm, e. g. binary search algorithm
• Web browsing data are sorted first by UserID, then by URL, and lastly be EventTime

Sparse index to locate granules

• Primary key columns are used to build a sparse index, which, when combined with column level
offset files (“mark”), can quickly locate matching data
➢ First element of the primary key columns is used for binary search algorithm
➢ Other elements of the primary key columns are used for generic exclusion search
algorithm (more on this later)
• Data are logically grouped into “granules”
➢ typically, 8192 rows, set by index_granularity
➢ for Bloomberg equity tick data on 5/27/2022, 1 granule = 55.5 GB / 595 mil. * 8192 =
0.76 MB, 1 ticker = 55.5 GB / 19K = 3 MB = 4 granules
• After being located by the sparse index, relevant granules are loaded into memory for parallel
data processing

8
Build and use the primary index
• The primary index has one entry per granule. The orange marked columns values are the
minimum values of each primary key column in each granule; they will be the entries in the
table’s primary index. The primary index file is completely loaded into the main memory (~6MB
for equity_tickdata table on 5/27, ~120MB if partitioning by month)
• The primary index is used for selecting granules: SELECT * FROM equity_tickdata
WHERE id = ‘AAPL’ AND timestamp >= toDateTime64(‘2022-05-27
00:00:00.000000,6)’ AND timestamp <= ‘2022-05-27
23:59:59.000000,6)’
➢ “id” is used for binary search algorithm
➢ “timestamp” is used for generic exclusion search algorithm to locate the relevant
granules

9
10
Use mark files
Primary index file locates the logical location of relevant granules, mark files locate the physical location
of the granules

• Locating via mark files happens to each column in parallel (hence the speed)
• Why not store that information directly in primary index? The primary index file needs to fit into
the main memory

Generic exclusion search algorithm

• The generic exclusion search algorithm is most effective when the predecessor key column has
low(er) cardinality
• On 5/27/2022, equity_tickdata has ~19K distinct IDs and ~136 mil. distinct timestamps,
#id ≪ #timestamp
• Details of this algorithm can be found at https://clickhouse.com/docs/en/guides/improving-query-
performance/sparse-primary-indexes/sparse-primary-indexes-multiple/#generic-exclusion-search-
algorithm

11
Data skipping index: a secondary index to group and skip granules
A secondary data skipping index on the URL column of the web browsing data with compound primary
key (UserID, URL)

• A secondary data skipping index on URL helps with excluding granules only if the #UserID ≪
#URL
• Data skipping index should only be used after investigating other alternatives (projections,
materialized views, etc.)
• Data skipping index behavior is not obvious from thought experiments alone

Test, Test, Test!

• Design of table schemas needs to be carefully considered for each business application.
• Use ClickHouse command-line client to have detailed performance information for each design.

12
Additional Resources
• ClickHouse Academy - Free self-paced ClickHouse Training (requires login to track
progress): https://clickhouse.com/learn/
• Monthly ClickHouse release webinars: https://clickhouse.com/company/news-events
• Monthly newsletter: https://clickhouse.com/company/news-events
• YouTube channel - recent recordings from Monthly releases & meetups:
https://www.youtube.com/c/ClickHouseDB
• Blogs - many recent articles of technical nature: https://clickhouse.com/blog
• ClickHouse Roadmap 2023: https://github.com/ClickHouse/ClickHouse/issues/44767

Chapter 1. Networking and Storage Concepts
No ratings yet
Chapter 1. Networking and Storage Concepts
31 pages
Dell Storage Manager 2018 R1.20
No ratings yet
Dell Storage Manager 2018 R1.20
31 pages
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
No ratings yet
S74435 - Empower Next-Generation AI With NVIDIA SuperPOD - 1741766783856001jmSm
31 pages
Hpe 3par Storeserv Storage: Update April, 2016
No ratings yet
Hpe 3par Storeserv Storage: Update April, 2016
27 pages
Dell Networking
No ratings yet
Dell Networking
27 pages
Storage Area Network
No ratings yet
Storage Area Network
5 pages
StorageCompare - Huawei Vs DELL EMC
No ratings yet
StorageCompare - Huawei Vs DELL EMC
8 pages
Storage Area Network (SAN) : Serving Information
No ratings yet
Storage Area Network (SAN) : Serving Information
60 pages
Subscription Agreement: 1 Proxmox VE Subscription Plans
No ratings yet
Subscription Agreement: 1 Proxmox VE Subscription Plans
5 pages
Security in SingleStore Services
No ratings yet
Security in SingleStore Services
19 pages
Brocade SAN Zoning Guide
No ratings yet
Brocade SAN Zoning Guide
34 pages
SQL 2019 Active
No ratings yet
SQL 2019 Active
13 pages
All Netapp2
No ratings yet
All Netapp2
167 pages
Introducing Actifio
No ratings yet
Introducing Actifio
50 pages
ASM Internals
No ratings yet
ASM Internals
96 pages
Evolution of SCSI Standards
No ratings yet
Evolution of SCSI Standards
6 pages
1 Module 1: Introduction To Information Storage
0% (1)
1 Module 1: Introduction To Information Storage
15 pages
Compaq COBOL File Status Values Guide
No ratings yet
Compaq COBOL File Status Values Guide
10 pages
Storage Devices & Technologies Guide
No ratings yet
Storage Devices & Technologies Guide
23 pages
Huawei OceanStor Dorado 8000 and 18000 All-Flash Storage Systems Data Sheet (For Read)
No ratings yet
Huawei OceanStor Dorado 8000 and 18000 All-Flash Storage Systems Data Sheet (For Read)
8 pages
Ans:-There Is No Direct Answer For This Question But We Shall Do It in Several Way
No ratings yet
Ans:-There Is No Direct Answer For This Question But We Shall Do It in Several Way
26 pages
Oracle Netapp Best Practices
No ratings yet
Oracle Netapp Best Practices
47 pages
NetApp NPI AFF StorageGrid SolidFire AltaVault
No ratings yet
NetApp NPI AFF StorageGrid SolidFire AltaVault
5 pages
Java Programming Overview at CMC Ltd.
100% (1)
Java Programming Overview at CMC Ltd.
6 pages
tr-4678 - Data Protection and Backup - FlexGroups
No ratings yet
tr-4678 - Data Protection and Backup - FlexGroups
34 pages
OceanStor Dorado 6.1.x SmartMigration Feature Guide For Block
No ratings yet
OceanStor Dorado 6.1.x SmartMigration Feature Guide For Block
96 pages
PowerVault MD32x0 Training Overview
No ratings yet
PowerVault MD32x0 Training Overview
64 pages
Live Demonstration of IBM Spectrum Virtualize HyperSwap in A VMware Environment
No ratings yet
Live Demonstration of IBM Spectrum Virtualize HyperSwap in A VMware Environment
67 pages
HP 3par Storeserv Storage: The Only Storage Architecture You Will Ever Need
No ratings yet
HP 3par Storeserv Storage: The Only Storage Architecture You Will Ever Need
16 pages
Dell Technologies SC420 Datasheet
No ratings yet
Dell Technologies SC420 Datasheet
3 pages
NMC Guide
No ratings yet
NMC Guide
581 pages
Layer 2 Networking and Protocols Explained
No ratings yet
Layer 2 Networking and Protocols Explained
59 pages
Remote Replication for Disaster Recovery
No ratings yet
Remote Replication for Disaster Recovery
22 pages
Dell Compellent fs8600 The Purpose of This Document Is To Cover Specific Implementation
No ratings yet
Dell Compellent fs8600 The Purpose of This Document Is To Cover Specific Implementation
25 pages
Implementing The IBM Storwise V3700
No ratings yet
Implementing The IBM Storwise V3700
544 pages
Using The DS Storage Manager (DSSM) Simulator To Demonstrate or Configure Storage On The DS3000 and DS5000 Storage Subsystems (sCR53)
No ratings yet
Using The DS Storage Manager (DSSM) Simulator To Demonstrate or Configure Storage On The DS3000 and DS5000 Storage Subsystems (sCR53)
33 pages
Compellent FS8600 Best Practices
No ratings yet
Compellent FS8600 Best Practices
25 pages
Cisco MDS 9500 Series Multilayer Directors
No ratings yet
Cisco MDS 9500 Series Multilayer Directors
3 pages
Scalar I3 Tape Library Datasheet (DS00509A)
No ratings yet
Scalar I3 Tape Library Datasheet (DS00509A)
2 pages
ActiveCluster Requirements and Best Practices
No ratings yet
ActiveCluster Requirements and Best Practices
11 pages
Storage scv3000 - Deployment Guide - en Us PDF
No ratings yet
Storage scv3000 - Deployment Guide - en Us PDF
109 pages
Data Migration To IBM Storage Systems PDF
No ratings yet
Data Migration To IBM Storage Systems PDF
562 pages
6-66880-17 RevA I6k Maint Guide I13
No ratings yet
6-66880-17 RevA I6k Maint Guide I13
776 pages
BMC Atrium CMDB 2.1 Common Data Model
100% (1)
BMC Atrium CMDB 2.1 Common Data Model
2 pages
3-2 Storage Data Protection Technologies and Applications
No ratings yet
3-2 Storage Data Protection Technologies and Applications
53 pages
01 - ADMIN SC Architecture 6.7C
No ratings yet
01 - ADMIN SC Architecture 6.7C
27 pages
Dell Powerscale Leaf Spine Network Best Practices
No ratings yet
Dell Powerscale Leaf Spine Network Best Practices
21 pages
Docu86399 - ViPR SRM 4.1.1 Alerting Guide
No ratings yet
Docu86399 - ViPR SRM 4.1.1 Alerting Guide
174 pages
Setting Up Squid Proxy Server
No ratings yet
Setting Up Squid Proxy Server
22 pages
Distributed File System Overview
100% (1)
Distributed File System Overview
30 pages
Docu86395 - ViPR SRM 4.1.1 Installation and Configuration Guide
No ratings yet
Docu86395 - ViPR SRM 4.1.1 Installation and Configuration Guide
78 pages
Ankit Pravinbhai Rupareliya.
No ratings yet
Ankit Pravinbhai Rupareliya.
5 pages
Top 60 Linux Commands
No ratings yet
Top 60 Linux Commands
32 pages
Huawei Storage Sales Training Guide
No ratings yet
Huawei Storage Sales Training Guide
41 pages
Customer Presentation - Unity Overview4
No ratings yet
Customer Presentation - Unity Overview4
26 pages
NetApp StorageGRID
No ratings yet
NetApp StorageGRID
4 pages
Hitachi Advanced Server DS220 G2 Service Guide FE-97HAS016-00
No ratings yet
Hitachi Advanced Server DS220 G2 Service Guide FE-97HAS016-00
144 pages
IBM Storwize V7000 Data Sheet
No ratings yet
IBM Storwize V7000 Data Sheet
3 pages
Quick Tour of ClickHouse Internals
No ratings yet
Quick Tour of ClickHouse Internals
34 pages
Chapter - 4 - Data Warehouse Indexes
No ratings yet
Chapter - 4 - Data Warehouse Indexes
11 pages
6 Series Mill Controller Operation Manual: 6 系列銑床操作手冊 Date: 2015/11/13
No ratings yet
6 Series Mill Controller Operation Manual: 6 系列銑床操作手冊 Date: 2015/11/13
154 pages
User ManualHXE34 Indonesia
50% (2)
User ManualHXE34 Indonesia
26 pages
WiFi Internet Controlled Relays Using ESP8266 Quic PDF
No ratings yet
WiFi Internet Controlled Relays Using ESP8266 Quic PDF
6 pages
Online Railway Ticket Booking System
No ratings yet
Online Railway Ticket Booking System
11 pages
1 Hard Disk Drive Basics PDF
No ratings yet
1 Hard Disk Drive Basics PDF
3 pages
Advanced R Programming
No ratings yet
Advanced R Programming
12 pages
How To Configure Windows 10 Privacy Settings During Setup
No ratings yet
How To Configure Windows 10 Privacy Settings During Setup
9 pages
Mobile Camera Setup With MS2D v1
No ratings yet
Mobile Camera Setup With MS2D v1
9 pages
Generative AI at The Edge
100% (1)
Generative AI at The Edge
37 pages
Tool Website Script For Blogger by Blogging Support
No ratings yet
Tool Website Script For Blogger by Blogging Support
12 pages
01 Microservices Material
No ratings yet
01 Microservices Material
4 pages
Python Lab for Auto Engineers
No ratings yet
Python Lab for Auto Engineers
36 pages
Univariate GARCH Models Package
No ratings yet
Univariate GARCH Models Package
95 pages
UX Design Questions
No ratings yet
UX Design Questions
14 pages
OpenFOAM Installation Guide
No ratings yet
OpenFOAM Installation Guide
5 pages
0901EC201113 Creative Problem Solving
No ratings yet
0901EC201113 Creative Problem Solving
19 pages
Integrating Remote Sensing, Image Processing and Machine Learning Algorithms For The Recognition and Counting of Agave Plants
No ratings yet
Integrating Remote Sensing, Image Processing and Machine Learning Algorithms For The Recognition and Counting of Agave Plants
9 pages
LPI Level 1 Linux Course Overview
No ratings yet
LPI Level 1 Linux Course Overview
10 pages
Apps Reviewer
No ratings yet
Apps Reviewer
14 pages
CNC3040Z-DQ 4 Axis Manual Description
No ratings yet
CNC3040Z-DQ 4 Axis Manual Description
9 pages
Scheduling Steel Weight in Autodesk Revit 2015 PDF
No ratings yet
Scheduling Steel Weight in Autodesk Revit 2015 PDF
9 pages
Lesson 15 Overview of FAThpfs and NTFS
No ratings yet
Lesson 15 Overview of FAThpfs and NTFS
12 pages
Optimization Models Overview
No ratings yet
Optimization Models Overview
18 pages
CC Module 4
No ratings yet
CC Module 4
13 pages
IOS Profile For Redis
No ratings yet
IOS Profile For Redis
6 pages
Supply Chain Risk Management With Machine Learning Technology - A Literature Review and Future Research Directions
No ratings yet
Supply Chain Risk Management With Machine Learning Technology - A Literature Review and Future Research Directions
12 pages
(Original PDF) C++ Programming From Problem Analysis to Program Design 8th ebook reference ebook
100% (6)
(Original PDF) C++ Programming From Problem Analysis to Program Design 8th ebook reference ebook
119 pages
Continuity Bisection Method
No ratings yet
Continuity Bisection Method
29 pages
Key Components of Infrastructure Planning For A Growing Business
No ratings yet
Key Components of Infrastructure Planning For A Growing Business
3 pages
SAP IDoc Data Transfer Guide
No ratings yet
SAP IDoc Data Transfer Guide
12 pages

sqf6 Clickhouse Guide Sample

Uploaded by

sqf6 Clickhouse Guide Sample

Uploaded by

A Practitioner’s Guide to ClickHouse

Yan Zeng, version 1.0, last updated on 2/25/2023

Row-Based DBMS vs. Column-Based DBMS

• Table schema for trade data:

• Table creation for trade data

Basic Statistics of One Day’s Tick Data

• Partition: directory for physical storage of data

• Data are divided by “partitions” (directories)

• PRIMARY KEY (UserID, URL) ORDER BY (UserID, URL, EventTime)

Directory creation, merging, and deletion

• This will lead to about 250 * 3 = 750 partitions

Key Takeaways for Partition

Sparse index to locate granules

Generic exclusion search algorithm

Test, Test, Test!

You might also like