Skip to content

T64 column codec#5557

Merged
alesapin merged 9 commits intoClickHouse:masterfrom
4ertus2:t
Jun 17, 2019
Merged

T64 column codec#5557
alesapin merged 9 commits intoClickHouse:masterfrom
4ertus2:t

Conversation

@4ertus2
Copy link
Copy Markdown
Contributor

@4ertus2 4ertus2 commented Jun 7, 2019

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

For changelog. Remove if this is non-significant change.

Category (leave one):

  • New Feature

Short description (up to few sentences):
Add new column codec: T64. Made for (U)IntX/EnumX/Data(Time)/DecimalX columns. It should be good for columns with constant or small range values. Codec itself allows enlarge or shrink data type without re-compression.

Detailed description (optional):
T64 codec gets 64 source UIntX values and transpose them into N UInt64 values, where N <= X. Full bytes transposed by bytes. The most significant (not full) bytes are also transposed by bits. Then codec removes unneeded part of matrix. UIntX * 64 -> UInt64 * X -> UInt64 * N.
If column has unique value codec saves header only and generate expected count of values on extract.
Codec saves min and max values in header to detect not needed bits. It's possible to use it in future as min-max index #4143. Currently it needs one more scan for source data to find the values. It's possible to avoid this scan for merges if we pass them from the merging parts.
Codec also saves source datatype id in header. In fact it's not needed cause transposed data would be the same for any UIntX or IntX type (signed and unsigned data differs). But currently we cannot get extracted type id form caller. If we passthrough the type from the caller we would be able to enlarge or shrink columns' data type without re-compression (or throw an error if it's not possible).

@4ertus2
Copy link
Copy Markdown
Contributor Author

4ertus2 commented Jun 7, 2019

Test

CREATE TABLE coin.trades_t64
(
    collector_id UInt8 CODEC(T64,LZ4),
    market Enum8('BINANCE' = 1, 'BITMEX' = 2, 'BITTREX' = 3, 'BITFINEX' = 4, 'HUOBI' = 5, 'ZB' = 6, 'OKEX' = 7, 'HITBTC' = 8, 'LBANK' = 9, 'BIBOX' = 10, 'BITZ' = 11, 'KRAKEN' = 12, 'BITSTAMP' = 13, 'POLONIEX' = 14, 'GATEIO' = 15, 'YOBIT' = 16, 'CEXIO' = 17, 'COINEGG' = 18, 'LIVECOIN' = 19, 'KUCOIN' = 20, 'OKEX_FUTURES' = 21, 'OKEX_SPOT' = 22, 'EXX' = 23) CODEC(T64,LZ4),
    id UInt64,
    base String,
    counter String,
    datetime DateTime CODEC(T64,LZ4),
    type Enum8('ASK' = 0, 'BID' = 1) CODEC(T64,LZ4),
    price Float64,
    amount Float64
)
ENGINE = MergeTree PARTITION BY toYYYYMM(datetime) ORDER BY (market, base, counter);

coin.trades_zstd is the same with CODEC(ZSTD) instead of CODEC(T64,LZ4)

INSERT INTO coin.trades_t64 SELECT * FROM coin.trades 
0 rows in set. Elapsed: 143.370 sec. Processed 368.25 million rows, 20.52 GB (2.57 million rows/s., 143.10 MB/s.) 

INSERT INTO coin.trades_zstd SELECT * FROM coin.trades
0 rows in set. Elapsed: 149.620 sec. Processed 368.25 million rows, 20.52 GB (2.46 million rows/s., 137.12 MB/s.)

OPTIMIZE TABLE coin.trades_t64 FINAL
OPTIMIZE TABLE coin.trades_zstd FINAL
/data/coin$ du -d0 trades*
4710836 trades
4168044 trades_t64
4292324 trades_zstd
  1649331 collector_id.bin
833569732 datetime.bin
  1640879 market.bin
171342154 type.bin

  1106821 collector_id.bin
407187011 datetime.bin
   356093 market.bin
 43720343 type.bin

   259230 collector_id.bin
515097904 datetime.bin
   252951 market.bin
 62919410 type.bin

@4ertus2
Copy link
Copy Markdown
Contributor Author

4ertus2 commented Jun 10, 2019

Some interesting point. It's slight faster to use better compressed column as right one in JOIN then LZ4. And it's much slower to use it on the left side.

Original

SELECT 
    A.datetime, 
    B.datetime
FROM coin.trades AS A 
ANY LEFT JOIN coin.trades AS B USING (datetime)
WHERE A.datetime != B.datetime

0 rows in set. Elapsed: 18.082 sec. Processed 736.51 million rows, 2.95 GB (40.73 million rows/s., 162.93 MB/s.)

Right table compressed

SELECT 
    A.datetime, 
    B.datetime
FROM coin.trades AS A 
ANY LEFT JOIN coin.trades_t64 AS B USING (datetime)
WHERE A.datetime != B.datetime

0 rows in set. Elapsed: 17.665 sec. Processed 736.51 million rows, 2.95 GB (41.69 million rows/s., 166.77 MB/s.) 

SELECT 
    A.datetime, 
    B.datetime
FROM coin.trades AS A 
ANY LEFT JOIN coin.trades_zstd AS B USING (datetime)
WHERE A.datetime != B.datetime

0 rows in set. Elapsed: 17.881 sec. Processed 736.51 million rows, 2.95 GB (41.19 million rows/s., 164.76 MB/s.)

Left table compressed

SELECT 
    A.datetime, 
    B.datetime
FROM coin.trades_t64 AS A 
ANY LEFT JOIN coin.trades AS B USING (datetime)
WHERE A.datetime != B.datetime

0 rows in set. Elapsed: 19.831 sec. Processed 736.51 million rows, 2.95 GB (37.14 million rows/s., 148.56 MB/s.)

SELECT 
    A.datetime, 
    B.datetime
FROM coin.trades_zstd AS A 
ANY LEFT JOIN coin.trades AS B USING (datetime)
WHERE A.datetime != B.datetime

0 rows in set. Elapsed: 21.327 sec. Processed 736.51 million rows, 2.95 GB (34.53 million rows/s., 138.14 MB/s.) 

@4ertus2 4ertus2 changed the title T64 column codec (concept) [vip] T64 column codec (concept) Jun 11, 2019
@4ertus2 4ertus2 changed the title [vip] T64 column codec (concept) T64 column codec Jun 11, 2019
@4ertus2
Copy link
Copy Markdown
Contributor Author

4ertus2 commented Jun 13, 2019

Left table compressed (after optimisations)

SELECT 
    A.datetime, 
    B.datetime
FROM coin.trades_t64 AS A 
ANY LEFT JOIN coin.trades AS B USING (datetime)
WHERE A.datetime != B.datetime

Ok.

0 rows in set. Elapsed: 18.764 sec. Processed 736.51 million rows, 2.95 GB (39.25 million rows/s., 157.01 MB/s.

157/163 < 4% perf degradation T64+LZ4 vs LZ4.

@alesapin alesapin self-requested a review June 17, 2019 11:03
@alesapin alesapin merged commit 0a15727 into ClickHouse:master Jun 17, 2019
@4ertus2 4ertus2 added the pr-feature Pull request with new product feature label Jun 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants