Merged
Conversation
Contributor
Author
|
Test coin.trades_zstd is the same with CODEC(ZSTD) instead of CODEC(T64,LZ4) |
Contributor
Author
|
Some interesting point. It's slight faster to use better compressed column as right one in JOIN then LZ4. And it's much slower to use it on the left side. Original Right table compressed Left table compressed |
Contributor
Author
|
Left table compressed (after optimisations) 157/163 < 4% perf degradation T64+LZ4 vs LZ4. |
alesapin
approved these changes
Jun 17, 2019
This was referenced Jun 24, 2019
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
For changelog. Remove if this is non-significant change.
Category (leave one):
Short description (up to few sentences):
Add new column codec: T64. Made for (U)IntX/EnumX/Data(Time)/DecimalX columns. It should be good for columns with constant or small range values. Codec itself allows enlarge or shrink data type without re-compression.
Detailed description (optional):
T64 codec gets 64 source UIntX values and transpose them into N UInt64 values, where N <= X. Full bytes transposed by bytes. The most significant (not full) bytes are also transposed by bits. Then codec removes unneeded part of matrix. UIntX * 64 -> UInt64 * X -> UInt64 * N.
If column has unique value codec saves header only and generate expected count of values on extract.
Codec saves min and max values in header to detect not needed bits. It's possible to use it in future as min-max index #4143. Currently it needs one more scan for source data to find the values. It's possible to avoid this scan for merges if we pass them from the merging parts.
Codec also saves source datatype id in header. In fact it's not needed cause transposed data would be the same for any UIntX or IntX type (signed and unsigned data differs). But currently we cannot get extracted type id form caller. If we passthrough the type from the caller we would be able to enlarge or shrink columns' data type without re-compression (or throw an error if it's not possible).