Refactoring of data types serialization by CurtizJ · Pull Request #21562 · ClickHouse/ClickHouse

CurtizJ · 2021-03-09T17:03:51Z

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

Not for changelog (changelog entry is not required)

Refactoring of data types serialization. It allows to choose different serializations for single type dynamically.
Part of #19953.

CurtizJ · 2021-03-09T17:03:59Z

However, now only default serialization is available (besides serialization of subcolumns) and not full interface of getting serializations is implemented in this PR, since it's worthless now. But it will look like this:

SerializationPtr IDataType::getSerialization(const String & column_name, const SerializationInfo & info) const
{
    ISerialization::Settings settings =
    {
        .num_rows = info.getNumberOfRows(),
        .num_non_default_rows = info.getNumberOfNonDefaultValues(column_name),
        .min_ratio_for_dense_serialization = 10
    };

    return getSerialization(settings);
}

SerializationPtr IDataType::getSerialization(const IColumn & column) const
{
    ISerialization::Settings settings =
    {
        .num_rows = column.size(),
        .num_non_default_rows = column.getNumberOfNonDefaultValues(),
        .min_ratio_for_dense_serialization = 10
    };

    return getSerialization(settings);
}

SerializationPtr IDataType::getSerialization(const ISerialization::Settings & settings) const
{
    if (settings.num_non_default_rows * settings.min_ratio_for_dense_serialization < settings.num_rows)
         return getSparseSerialization();

    return getDefaultSerialization();
}

// static
SerializationPtr IDataType::getSerialization(const NameAndTypePair & column, const IDataType::StreamExistenceCallback & callback)
{
    if (column.isSubcolumn())
    {
        auto base_serialization_getter = [&](const IDataType & subcolumn_type)
        {
            return subcolumn_type.getSerialization(column.name, callback);
        };

        auto type_in_storage = column.getTypeInStorage();
        return type_in_storage->getSubcolumnSerialization(column.getSubcolumnName(), base_serialization_getter);
    }

    return column.type->getSerialization(column.name, callback);
}

SerializationPtr IDataType::getSerialization(const String & column_name, const StreamExistenceCallback & callback) const
{
    auto sparse_idx_name = escapeForFileName(column_name) + ".sparse.idx";
    if (callback(sparse_idx_name))
        return getSparseSerialization();

    return getDefaultSerialization();
}

There are two new entities: SerializationInfo and SerializationSettings. The first one will contain any useful information about content of columns and will be stored in data parts. The second one can be created for one column from SerializationInfo or from the column itself.

CurtizJ · 2021-03-09T17:07:47Z

This PR has a huge diff, but mostly it's just moving serializations to separate files. However, git history for most serializations (largest part of code of data types) is kept, because there was a move beetween files in 1e61f64. But github is unable to show it properly in the interface.

Example of blame:

alesapin

Of course, I haven't fully read this PR, but the idea is clear and these changes have been asking for a long time.

Maybe it would be better if we use CRTP for data types. In this case, we can save the pointer to ISerialization into the child object (in constructor) and return it from some method in IDataType using something like this static_cast<Child>(this)->getSerializationPointer(). It will allow us to avoid creating shared pointers to ISerialization each time or save them in the user code and manually control consistency between type and serialization (example in MergeTreeDataWriterWide/OnDisk). But this is just a suggestion, the current implementation is also Ok and because of size, we should merge this PR as fast as possible.

CurtizJ · 2021-03-26T17:07:03Z

@alesapin
Maybe merge this PR now and implement CRTP for data types later? However, it will not avoid of saving precomputied pointers to serializations in some cases, when dynamic serializations will be ready, because they anyway will give overhead for choosing serialization and creating a new object.

CurtizJ · 2021-03-29T13:36:17Z

Failed tests are the same as in master.

CurtizJ added 4 commits March 9, 2021 17:04

move data types to serializations

1e61f64

return back data types

be540e4

refactoring of serializations

bc417cf

refactoring of serializations

df6663d

robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Mar 9, 2021

CurtizJ marked this pull request as draft March 9, 2021 17:08

CurtizJ changed the title ~~Serialization refactoring 4~~ Refactoring of data types serialization Mar 9, 2021

CurtizJ mentioned this pull request Mar 9, 2021

Data types serializations refactoring [Not for merge] #21475

Closed

CurtizJ force-pushed the serialization-refactoring-4 branch 2 times, most recently from f7c7c5a to df6663d Compare March 12, 2021 22:06

CurtizJ added 4 commits March 13, 2021 01:41

Merge remote-tracking branch 'upstream/master' into HEAD

ed42437

fix unwanted changes

1b07d28

fix arcadia build

6800e53

slightly better performance

81ac638

CurtizJ marked this pull request as ready for review March 13, 2021 18:25

remove debug output

7304ef8

alesapin approved these changes Mar 15, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into HEAD

173d2ea

alexey-milovidov mentioned this pull request Mar 16, 2021

Roadmap 2021 (discussion) #17623

Closed

CurtizJ added 4 commits March 16, 2021 16:50

Merge remote-tracking branch 'upstream/master' into HEAD

6247cd5

Merge remote-tracking branch 'upstream/master' into HEAD

6a15431

fix build

da2d0d2

fux build

421d8eb

CurtizJ merged commit ea82e77 into ClickHouse:master Mar 29, 2021

alesapin self-assigned this Mar 29, 2021

vitlibar mentioned this pull request Jun 21, 2021

Cherry pick #25015 to 21.3: Remove a chunk of wrong code and look what will happen #25189

Closed

This was referenced Jun 21, 2021

Cherry pick #24464 to 21.3: Fix tuples in 'CREATE .. AS SELECT' queries #24608

Closed

Cherry pick #25015 to 20.8: Remove a chunk of wrong code and look what will happen #25190

Closed

nvartolomei mentioned this pull request Jul 20, 2021

Backwards compatibility broken when nesting Nullable inside LowCardinality data type #26556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring of data types serialization#21562

Refactoring of data types serialization#21562
CurtizJ merged 14 commits intoClickHouse:masterfrom
CurtizJ:serialization-refactoring-4

CurtizJ commented Mar 9, 2021

Uh oh!

CurtizJ commented Mar 9, 2021 •

edited

Loading

Uh oh!

CurtizJ commented Mar 9, 2021 •

edited

Loading

Uh oh!

alesapin left a comment

Uh oh!

CurtizJ commented Mar 26, 2021

Uh oh!

CurtizJ commented Mar 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

CurtizJ commented Mar 9, 2021

Uh oh!

CurtizJ commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CurtizJ commented Mar 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alesapin left a comment

Choose a reason for hiding this comment

Uh oh!

CurtizJ commented Mar 26, 2021

Uh oh!

CurtizJ commented Mar 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CurtizJ commented Mar 9, 2021 •

edited

Loading

CurtizJ commented Mar 9, 2021 •

edited

Loading