Refactoring of data types serialization#21562
Conversation
|
However, now only default serialization is available (besides serialization of subcolumns) and not full interface of getting serializations is implemented in this PR, since it's worthless now. But it will look like this: SerializationPtr IDataType::getSerialization(const String & column_name, const SerializationInfo & info) const
{
ISerialization::Settings settings =
{
.num_rows = info.getNumberOfRows(),
.num_non_default_rows = info.getNumberOfNonDefaultValues(column_name),
.min_ratio_for_dense_serialization = 10
};
return getSerialization(settings);
}
SerializationPtr IDataType::getSerialization(const IColumn & column) const
{
ISerialization::Settings settings =
{
.num_rows = column.size(),
.num_non_default_rows = column.getNumberOfNonDefaultValues(),
.min_ratio_for_dense_serialization = 10
};
return getSerialization(settings);
}
SerializationPtr IDataType::getSerialization(const ISerialization::Settings & settings) const
{
if (settings.num_non_default_rows * settings.min_ratio_for_dense_serialization < settings.num_rows)
return getSparseSerialization();
return getDefaultSerialization();
}
// static
SerializationPtr IDataType::getSerialization(const NameAndTypePair & column, const IDataType::StreamExistenceCallback & callback)
{
if (column.isSubcolumn())
{
auto base_serialization_getter = [&](const IDataType & subcolumn_type)
{
return subcolumn_type.getSerialization(column.name, callback);
};
auto type_in_storage = column.getTypeInStorage();
return type_in_storage->getSubcolumnSerialization(column.getSubcolumnName(), base_serialization_getter);
}
return column.type->getSerialization(column.name, callback);
}
SerializationPtr IDataType::getSerialization(const String & column_name, const StreamExistenceCallback & callback) const
{
auto sparse_idx_name = escapeForFileName(column_name) + ".sparse.idx";
if (callback(sparse_idx_name))
return getSparseSerialization();
return getDefaultSerialization();
}There are two new entities: |
|
This PR has a huge diff, but mostly it's just moving serializations to separate files. However, git history for most serializations (largest part of code of data types) is kept, because there was a move beetween files in 1e61f64. But github is unable to show it properly in the interface. |
f7c7c5a to
df6663d
Compare
alesapin
left a comment
There was a problem hiding this comment.
Of course, I haven't fully read this PR, but the idea is clear and these changes have been asking for a long time.
Maybe it would be better if we use CRTP for data types. In this case, we can save the pointer to ISerialization into the child object (in constructor) and return it from some method in IDataType using something like this static_cast<Child>(this)->getSerializationPointer(). It will allow us to avoid creating shared pointers to ISerialization each time or save them in the user code and manually control consistency between type and serialization (example in MergeTreeDataWriterWide/OnDisk). But this is just a suggestion, the current implementation is also Ok and because of size, we should merge this PR as fast as possible.
|
@alesapin |
|
Failed tests are the same as in master. |

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en
Changelog category (leave one):
Refactoring of data types serialization. It allows to choose different serializations for single type dynamically.
Part of #19953.