Skip to content

Improve performance of subcolumns reading from compact parts #76141

@Avogar

Description

@Avogar

Right now when we request a subcolumn from a compact part we read the whole column and then extract the requested subcolumn in memory:

if (name_and_type.isSubcolumn())
{
const auto & type_in_storage = name_and_type.getTypeInStorage();
const auto & name_in_storage = name_and_type.getNameInStorage();
const auto & serialization = serializations_of_full_columns.at(name_in_storage);
ColumnPtr temp_full_column = getFullColumnFromCache(columns_cache_for_subcolumns, name_in_storage);
if (!temp_full_column)
{
temp_full_column = type_in_storage->createColumn(*serialization);
serialization->deserializeBinaryBulkWithMultipleStreams(temp_full_column, rows_to_read, deserialize_settings, deserialize_binary_bulk_state_map_for_subcolumns[name_in_storage], nullptr);
if (columns_cache_for_subcolumns)
columns_cache_for_subcolumns->emplace(name_in_storage, temp_full_column);
}
auto subcolumn = type_in_storage->getSubcolumn(name_and_type.getSubcolumnName(), temp_full_column);
/// TODO: Avoid extra copying.
if (column->empty())
column = IColumn::mutate(subcolumn);
else
column->assumeMutable()->insertRangeFrom(*subcolumn, 0, subcolumn->size());
}
.

When the column is large (for example when it's a large JSON column) it takes quite a lot of time.

To read separate subcolumns from compact parts we need to modify the format a bit and store information about each substream offset (right now we store offset of each column) to be able to read individual substreams separately.

@CurtizJ WDYT?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions