Skip to content

Add FSST as compression codec#62670

Closed
Pelanglene wants to merge 50 commits intoClickHouse:masterfrom
Pelanglene:fsst
Closed

Add FSST as compression codec#62670
Pelanglene wants to merge 50 commits intoClickHouse:masterfrom
Pelanglene:fsst

Conversation

@Pelanglene
Copy link
Copy Markdown

@Pelanglene Pelanglene commented Apr 15, 2024

Closes #34246

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Added FSST (Fast Static Symbol Table) as a new compression codec for string columns.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/


Modify your CI run:

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with Analyzer
  • Add your option here

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • Add your option here

Extra options:

  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 15, 2024

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ nikitamikhaylov
✅ rschu1ze
❌ Ulad


Ulad seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Apr 16, 2024
@vdimir vdimir changed the title Fsst Add FSST (Fast Static Symbol Table) Compression Codec Apr 17, 2024
@clickhouse-ci
Copy link
Copy Markdown

clickhouse-ci bot commented Apr 17, 2024

This is an automatic comment. The PR descriptions does not match the template.

Please, edit it accordingly.

The error is: More than one changelog category specified: 'New Feature', 'Improvement'

@vdimir
Copy link
Copy Markdown
Member

vdimir commented Apr 17, 2024

Ref #34246

@robot-ch-test-poll3 robot-ch-test-poll3 added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Apr 25, 2024
@robot-ch-test-poll3
Copy link
Copy Markdown
Contributor

robot-ch-test-poll3 commented Apr 25, 2024

This is an automated comment for commit e6a7325 with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check nameDescriptionStatus
A SyncThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS⏳ pending
CI runningA meta-check that indicates the running CI. Normally, it's in success or pending state. The failed status indicates some problems with the PR❌ failure
Mergeable CheckChecks if all other necessary checks are successful❌ failure
Style checkRuns a set of checks to keep the code style clean. If some of tests failed, see the related log from the report❌ failure
Successful checks
Check nameDescriptionStatus
Docs checkBuilds and tests the documentation✅ success
PR CheckThere's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS✅ success

@alexey-milovidov
Copy link
Copy Markdown
Member

Option 1: Expand the ICompressionCodec interface, add an optional Filter parameter to the decompression method—a structure containing, for example, required_substrings. The codec pretends to decompress all data but has the right to decompress unnecessary strings as empty. The filter is created in the query interpreter (more precisely, during query analysis) and then is carried through all data reading interfaces down to decompression.

Option 2: If a String (possibly Nullable, Array of String) uses the FSST codec, during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

Without these ways to accelerate, the codec does not make much sense - for raw performance/compression rate, it is strictly worse than LZ4 or ZSTD.

@rschu1ze rschu1ze changed the title Add FSST (Fast Static Symbol Table) Compression Codec Add FSST Compression Codec May 8, 2024
@rschu1ze rschu1ze changed the title Add FSST Compression Codec Add FSST as compression codec May 8, 2024

void registerCodecFSST(CompressionCodecFactory & factory)
{
auto codec_builder = [&](const ASTPtr & arguments) -> CompressionCodecPtr
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should restrict FSST to String and FixedString columns here. Check the registerCodec method src/Compression/CompressionCodecGorilla.cpp (which is a floating-point-value-only codec) how to do that.

Also, we should check that no further parameters were passed into CODEC FSST() when defined in SQL. Please see the registerCodec method in src/Compression/CompressionCodecGCD.cpp.

Both cases also need to have a negative SQL test, i.e. a test case in 02973_fsst_code_test_data.sql that expect an exception:

CREATE TABLE table_fsst_codec (n String CODEC(FSST('an_additional_argument'))) ENGINE = Memory; -- { serverError ILLEGAL_SYNTAX_FOR_CODEC_TYPE }


void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /*ignore_aliases=*/true); }

static const int OUT_SIZE = 2281337;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic constant, please add a comment behind it how it was calculated.


namespace DB
{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my understanding: Why are we using the original FSST version and not the FSST12 variant?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ /// Implements FSST compression based on https://github.com/cwida/fsst

{
public:
explicit CompressionCodecFSST() { setCodecDescription("FSST"); }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we mark the codec experimental for now? (ICompressionCodec::isExperimental)

@@ -0,0 +1,19 @@
option(ENABLE_FSST "Enable FSST (Fast Static Symbol Table)" ${ENABLE_LIBRARIES})

if (NOT ENABLE_FSST)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: the AVX512 detection at the beginning of fsst_avx512.cpp looks x86-specific. Test what happens on ARM, possibly exclude fsst_avx512.cpp from compilation on non-x86 platforms.

Copy link
Copy Markdown
Member

@rschu1ze rschu1ze left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how compression ration and performance compares to ZSTD?

You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.

endif ()

target_link_libraries (clickhouse_common_io PRIVATE ch_contrib::lz4)
# target_link_libraries (clickhouse_common_io PRIVATE ch_contrib::fsst)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this.

)

if (TARGET ch_contrib::fsst)
dbms_target_link_libraries(PRIVATE ch_contrib::fsst)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try if things still build if we only link clickhouse_compression, i.e omit l. 422.


namespace DB
{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ /// Implements FSST compression based on https://github.com/cwida/fsst


void updateHash(SipHash & hash) const override { getCodecDesc()->updateTreeHash(hash, /*ignore_aliases=*/true); }

static constexpr int out_size = 2281337;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment which explains how the constant was selected.


size_t len_out[rows_count];
unsigned char * str_out[rows_count];
size_t header_size{fsst_header_size + sizeof(rows_count) + sizeof(len_out) + (sizeof(size_t) * len_in.size())};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Let's use standard = initialization instead of use uniform initialization. The former is how it is done in the rest of the codebase. Please change all such places in this file.

}
}

UInt32 getMaxCompressedDataSize(UInt32 uncompressed_size) const override { return uncompressed_size + FSST_MAXHEADER; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move l. 110-113 up to l. 31 so all "metadata" functions are in one place.


while (data != end)
{
UInt64 cur_len;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pleasantly surprised that ClickHouse serializes String columns as is consecutive strings instead as two arrays (strings + length). That, of course, simplifies the job of splitDataByRows.


void doDecompressData(const char * source, UInt32 source_size, char * dest, UInt32 uncompressed_size) const override
{
UNUSED(uncompressed_size, source_size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do instead in l. 78: UInt32 /*uncompressed_size*/)

ASSERT_THROW(codec->decompress(source, source_size, memory.data()), Exception);
}

TEST(FSSTTest, CompressDecompress)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ unit tests are kind of underused in ClickHouse. Imho, all of what this test does can be achieved with a SQL-based test. The advantage of the latter is that it will run with all kinds of sanitizers in CI.

Suggest to remove this test and extend the SQL test instead.


CREATE TABLE table_fsst_codec (n String CODEC(FSST)) ENGINE = Memory;
INSERT INTO table_fsst_codec VALUES ('Hello'), ('world'), ('!');
SELECT * FROM table_fsst_codec;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ DROP TABLE

{
dest = writeVarUInt(len_in[i], dest);

auto decompressed_size = fsst_decompress(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs in fsst.h say If > size, the decoded output is truncated to size.. Can this happen? Should we check?

@trololo23
Copy link
Copy Markdown

trololo23 commented May 10, 2024

I am curious how compression ration and performance compares to ZSTD?

You could load any of these datasets https://clickhouse.com/docs/en/getting-started/example-datasets and re-compress the string columns with different codecs, then do full-column scans over then so they are decompressed.

Hello! Here are the compression ratio on various datasets:

Trips - pickup_ntaname and dropoff_ntaname columns:

  • FSST
    fsst
  • STRING(LZ4)
    string_lz
  • LowCardinality
    trips_low_cardinality
  • ZSTD
    trips_zstd

Amazon reviews - review_body column:

  • FSST
    amazon_fsst
  • STRING(LZ4)
    amazon_string
  • ZSTD
    amazon_zstd

Reddit comments - body column:

  • FSST
    redit_fsst
  • STRING(LZ4)
    redit_string
  • ZSTD
    redit_zstd

To summarize, FSST is significantly inferior to ZSTD in terms of compression ratio, but in some cases (mainly on large rows) it surpasses LZ4. We will add the results of the performance measurement soon.

@rschu1ze
Copy link
Copy Markdown
Member

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

  • get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
  • filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

@trololo23
Copy link
Copy Markdown

trololo23 commented May 21, 2024

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

  • get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
  • filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

Interesting, thanks. The comparison with LZ4 in terms of compression rate confirms the findings in the FSST paper. And of course, Zstd compresses really well...

I guess the true advantage of FSST as a light-weight compression codec compared to the existing heavy-weight codecs in ClickHouse is its ability for random access. Unfortunately, ICompressionCodec::doDecompress decompresses the entire block at once, so the existing API can't leverage random access.

@Pelanglene I am not sure how much more time you can/want to spend on this project, but there are two interesting directions to explore (suggested by @alexey-milovidov):

Option 1: Expand the ICompressionCodec interface for random access.

There are two sub-cases:

  • get: decompress selected rows instead of the whole thing. This is useful for projection when most rows were removed by filters already. API-wise, one needs to pass in a bit vector into doDecompress (or an overload of doDecompress) to indicate the rows to decompress. FSST is well suited for that use case.
  • filter: aka. this is filter pushdown to the codec level. Some codecs (including FSST) allow to evaluate predicates on compressed data, i.e. without decompression. As an example, the simplest case would be an equality filter on dictionary/domain-compresed (LowCardinality in ClickHouse) columns where one would translate x = 'abc' into the value id of x, and then scan the (compressed) value id vector for matches. The question is how to represent the filter ... I guess arbitrary operator + a list of arbitrary operand(s) will do the trick. This is as expressive as everything what the user can write in SQL. To take advantage of this, one also needs to create the filter in the query interpreter (more precisely, in query analysis) and carry it through all data reading interfaces down to the codec level.

Option 2: If a String (possibly Nullable, Array(String)) uses the FSST codec, then during deserialization, return a ColumnFSST instead of ColumnString. This is a new type of column that contains uncompressed bytes from one or more granules for subsequent lazy decompression. Only a few of its methods are implemented (filter, cut), whereas in almost all other cases, it must first be materialized into a full-fledged column.

At the moment, I'm trying to implement method number 2. It seemed to me that the first option was almost impossible due to the huge number of abstractions, that need to be filtered through, such as streams, compressed buffers, various Select processor components, and so on. The second way I'm trying to implement is like CompressedColumn, which has a slightly similar idea.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 2, 2024

Dear @rschu1ze, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@rschu1ze
Copy link
Copy Markdown
Member

rschu1ze commented Jul 2, 2024

There are unfortunately too many unknown unknowns, e.g., what is the performance (compression/decompression) and size impact of ColumFSST, while the implementation is not finished. I still like the direction (FSSTs as internal representation for String/FixedString columns, as well as predicate pushdown to the codec level) but this needs more experimentation. I guess we should give it another try next year.

@rschu1ze rschu1ze closed this Jul 2, 2024
@alexey-milovidov
Copy link
Copy Markdown
Member

@rschu1ze, the next year is today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Apply FSST compression for strings

8 participants