Skip to content

Refactoring of data types serialization#21562

Merged
CurtizJ merged 14 commits intoClickHouse:masterfrom
CurtizJ:serialization-refactoring-4
Mar 29, 2021
Merged

Refactoring of data types serialization#21562
CurtizJ merged 14 commits intoClickHouse:masterfrom
CurtizJ:serialization-refactoring-4

Conversation

@CurtizJ
Copy link
Copy Markdown
Member

@CurtizJ CurtizJ commented Mar 9, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Refactoring of data types serialization. It allows to choose different serializations for single type dynamically.
Part of #19953.

@robot-clickhouse robot-clickhouse added the pr-not-for-changelog This PR should not be mentioned in the changelog label Mar 9, 2021
@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Mar 9, 2021

However, now only default serialization is available (besides serialization of subcolumns) and not full interface of getting serializations is implemented in this PR, since it's worthless now. But it will look like this:

SerializationPtr IDataType::getSerialization(const String & column_name, const SerializationInfo & info) const
{
    ISerialization::Settings settings =
    {
        .num_rows = info.getNumberOfRows(),
        .num_non_default_rows = info.getNumberOfNonDefaultValues(column_name),
        .min_ratio_for_dense_serialization = 10
    };

    return getSerialization(settings);
}

SerializationPtr IDataType::getSerialization(const IColumn & column) const
{
    ISerialization::Settings settings =
    {
        .num_rows = column.size(),
        .num_non_default_rows = column.getNumberOfNonDefaultValues(),
        .min_ratio_for_dense_serialization = 10
    };

    return getSerialization(settings);
}

SerializationPtr IDataType::getSerialization(const ISerialization::Settings & settings) const
{
    if (settings.num_non_default_rows * settings.min_ratio_for_dense_serialization < settings.num_rows)
         return getSparseSerialization();

    return getDefaultSerialization();
}

// static
SerializationPtr IDataType::getSerialization(const NameAndTypePair & column, const IDataType::StreamExistenceCallback & callback)
{
    if (column.isSubcolumn())
    {
        auto base_serialization_getter = [&](const IDataType & subcolumn_type)
        {
            return subcolumn_type.getSerialization(column.name, callback);
        };

        auto type_in_storage = column.getTypeInStorage();
        return type_in_storage->getSubcolumnSerialization(column.getSubcolumnName(), base_serialization_getter);
    }

    return column.type->getSerialization(column.name, callback);
}

SerializationPtr IDataType::getSerialization(const String & column_name, const StreamExistenceCallback & callback) const
{
    auto sparse_idx_name = escapeForFileName(column_name) + ".sparse.idx";
    if (callback(sparse_idx_name))
        return getSparseSerialization();

    return getDefaultSerialization();
}

There are two new entities: SerializationInfo and SerializationSettings. The first one will contain any useful information about content of columns and will be stored in data parts. The second one can be created for one column from SerializationInfo or from the column itself.

@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Mar 9, 2021

This PR has a huge diff, but mostly it's just moving serializations to separate files. However, git history for most serializations (largest part of code of data types) is kept, because there was a move beetween files in 1e61f64. But github is unable to show it properly in the interface.

Example of blame:
image

@CurtizJ CurtizJ marked this pull request as draft March 9, 2021 17:08
@CurtizJ CurtizJ changed the title Serialization refactoring 4 Refactoring of data types serialization Mar 9, 2021
@CurtizJ CurtizJ force-pushed the serialization-refactoring-4 branch 2 times, most recently from f7c7c5a to df6663d Compare March 12, 2021 22:06
@CurtizJ CurtizJ marked this pull request as ready for review March 13, 2021 18:25
Copy link
Copy Markdown
Member

@alesapin alesapin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, I haven't fully read this PR, but the idea is clear and these changes have been asking for a long time.

Maybe it would be better if we use CRTP for data types. In this case, we can save the pointer to ISerialization into the child object (in constructor) and return it from some method in IDataType using something like this static_cast<Child>(this)->getSerializationPointer(). It will allow us to avoid creating shared pointers to ISerialization each time or save them in the user code and manually control consistency between type and serialization (example in MergeTreeDataWriterWide/OnDisk). But this is just a suggestion, the current implementation is also Ok and because of size, we should merge this PR as fast as possible.

@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Mar 26, 2021

@alesapin
Maybe merge this PR now and implement CRTP for data types later? However, it will not avoid of saving precomputied pointers to serializations in some cases, when dynamic serializations will be ready, because they anyway will give overhead for choosing serialization and creating a new object.

@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented Mar 29, 2021

Failed tests are the same as in master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants