Skip to content

[RFC] Different representation of columns. Sparse columns. #19953

@CurtizJ

Description

@CurtizJ

Use case
This issue describes the details of refactoring, which moves ser/de of columns from IDataType and allows to choose dynamically different representations of columns in every data part. Also it describes implementation of sparse representation on disk.
It is the prerequisite for other bigger task: to choose dynamically optimal representation (LowCardinality, Sparse or Dense) and codec of columns.

Describe the solution you'd like
The plan:

Introduce interface ISerialization which will be responsible to how serialize and deserialize columns.
Move enumerateStreams, getFileNameForStream and serialize*/deserialize* method to it from IDataType as is.
Implement them for every data type.

Introduce methods:

  • SerializationPtr IDataType::getSerialization(const IColumn & column)
    It will determine to which representation (sparse/dense) write column according to its content and will return appropriate Serialization. Can be used at inserts, when we write full column from memory.
  • SerializationPtr IDataType::getSerialization(const SerializationSettings & settings)
    The same as above, but can be used when we don't have column in memory and know statistics about its content (number of rows, number of non-default values, etc..). Can be used for merges.
  • SeserializationPtr IDataType::getSerialization(cosnt NameAndTypePair & name_and_type, ExistenceCallback callback)
    Used for deserialization, when we need to deterimine which serialization to use from files, written on disk. It will ask by callback existence of some files and according to this information will return appropriate Seserialization.

For now there will be 3 types of Serialization:

  • Default.
  • Sparse. It will write only non zero values from column using default serialization and will write separate stream with offsets for them.
  • Subcolumns. The same logic as in DataTypeOneElementTupleStreams now. Wraps default deserializations, but some substreams will have names like in named tuples to proper reading of subcolumns. It will be used only for deserialization. Getting it will have the same logic as IDataType::tryGetSubcolumnType now.

For getting all types of Serializations there will be the corresponding methods.

Details of sparse serialization

Every column in part will be written in Default or Sparse serizalition. Serialization will be chosen before reading/writing column.
Parts will store number of non-empty values for every column. During merges serializations will be chosen according to summary number of non-empty values in column among all merged parts.
For loading data parts we can store this metadata in separate file.

For first iteration in-memory column representation will be always dense. But, when some kind of ColumnSparse will be implemented, not much changes in code are expected.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions