Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
bc8e8c2
initial more or less working version
Jul 16, 2020
0ce74cc
updated interface, bug investigation
Jul 16, 2020
80053d7
first truly working version
Jul 17, 2020
52dcb21
Merge branch 'master' into bug/low-cardinality-arrays-optimisations
Jul 17, 2020
00fc61c
updated getOrFindIndex for NullableTypes
Jul 17, 2020
b60b689
fixed Const(Array()) DB::Exception
Jul 17, 2020
ad8afc3
updated LC spec to catch LC(N(T)) and N(LC(T))
Jul 20, 2020
222eb7f
fixed result column overwriting
Jul 22, 2020
1d0bf93
added perftest, fixed style check and PVS bugs
Jul 22, 2020
6af6321
Merge remote-tracking branch 'upstream/master' into bug/low-cardinali…
Jul 22, 2020
586ecf0
fixed 2 explicit constructors
Jul 22, 2020
61ef885
reverting explicit fix
Jul 22, 2020
76a866a
another attribute fix
Jul 22, 2020
31fc55c
fix: const attributes
Jul 22, 2020
9b9360e
try to handle runtime type casting to compare the index column and the
Jul 22, 2020
4a55186
some investigation
Jul 23, 2020
b44b3e9
added column cast, added new test
Jul 24, 2020
63938e0
add: new test reference
Jul 24, 2020
1e32210
updated reference, fixed NULL values handling
Jul 24, 2020
6f953e0
fix: style-checker
Jul 24, 2020
ecf6f62
Merge remote-tracking branch 'upstream/master' into bug/low-cardinali…
Jul 24, 2020
da9502e
Merge remote-tracking branch 'upstream/master' into bug/low-cardinali…
Jul 28, 2020
aaa05f2
updated types checker
Jul 28, 2020
cf696af
fix: function visibility
Jul 28, 2020
0f5b089
fix: ubsan, EMPTY_STRING_REF constant
Jul 28, 2020
059a3dc
Fix up typo in test
Akazz Aug 3, 2020
98119af
Merge remote-tracking branch 'upstream/master' into bug/low-cardinali…
Aug 4, 2020
e270706
Merge branch 'bug/low-cardinality-arrays-optimisations' of github:myr…
Aug 4, 2020
d4e747f
deleted extra-slow perftest to figure out outer coverage
Aug 4, 2020
aa15829
re-adding the test with removed slow requests.
Aug 4, 2020
b210afb
cloning the arr column to build the lc index
Aug 4, 2020
807f188
some data flow fixes
Aug 4, 2020
f2fecb4
re-adding the non-const test queries
Aug 4, 2020
2f589c6
added another query to perftest
Aug 4, 2020
0021f67
re-unioned the *NumImpl
Aug 5, 2020
4dd39f6
fixed comparison bug by cloning the col_arg_indices
Aug 5, 2020
2d27fa0
fix: style error, unused check
Aug 5, 2020
c8f737c
fix: re-resizing, Arcadia compatibility, updated perftest
Aug 6, 2020
8c1dbb3
For compatibility with Arcadia
Akazz Aug 6, 2020
1c940a1
got rid of a virtual col in loop in arrayIndex
Aug 6, 2020
d1412e5
Merge branch 'bug/low-cardinality-arrays-optimisations' of github:myr…
Aug 6, 2020
c4ddabc
updated perf test with integral example (cloning the column)
Aug 6, 2020
ac77e2a
empty commit to restart CI
Aug 7, 2020
a13ee27
dispatched another column to get rid of virtual call
Aug 7, 2020
fe38e50
added special ColumnConst case
Aug 7, 2020
5a742ec
test update
Aug 8, 2020
403a47b
reverted the code back to constant args handling
Aug 10, 2020
565cedb
updating functions
Aug 10, 2020
8562c36
fix: false processing
Aug 10, 2020
659c78c
more canonical resize
Aug 10, 2020
eabcc29
fixed generic LC impl
Aug 11, 2020
6e0d536
Merge branch 'master' into bug/low-cardinality-arrays-optimisations
Akazz Aug 11, 2020
67d716b
revert accidental changes
Aug 11, 2020
82206e8
fixed perftest long test and StringRef assert
Aug 12, 2020
1ed406d
empty commit to restart CI
Aug 12, 2020
e2e38e3
Merge remote-tracking branch 'upstream/master' into bug/low-cardinali…
Aug 12, 2020
0edf3a6
fixing the processGeneric ColumnVector type bug
Aug 13, 2020
a9c9bae
fixed the columnvector test
Aug 13, 2020
0b53b03
fixing perftest to get more relevant result
Aug 14, 2020
a519559
updated ColumnVector to compare with other U, cleaned ImplString
Aug 14, 2020
54f5c0e
fixed the String impl non-const offset bug
Aug 14, 2020
69a0ca2
checking other const approach
Aug 17, 2020
3337d20
some bugs remaining
Aug 17, 2020
5ef5c88
moved the LC stuff to its implementations (some bugs remaining)
Aug 18, 2020
4f96291
rewrote the LC spec as a dispatcher
Aug 19, 2020
d298409
Merge remote-tracking branch 'upstream/master' into bug/low-cardinali…
Aug 20, 2020
ce0ab0e
Split array_index_low_cardinality performance test: numbers/strings
Akazz Aug 23, 2020
eac2c59
Fix up
Akazz Aug 23, 2020
48333b2
Minior rename
Akazz Aug 23, 2020
d99ddb9
In perf test array_index_lc: adjusted iterations count
Akazz Aug 24, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 14 additions & 3 deletions base/common/StringRef.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,21 +21,26 @@
#endif


/// The thing to avoid creating strings to find substrings in the hash table.
/**
* The std::string_view-like container to avoid creating strings to find substrings in the hash table.
*/
struct StringRef
{
const char * data = nullptr;
size_t size = 0;

/// Non-constexpr due to reinterpret_cast.
template <typename CharT, typename = std::enable_if_t<sizeof(CharT) == 1>>
constexpr StringRef(const CharT * data_, size_t size_) : data(reinterpret_cast<const char *>(data_)), size(size_)
StringRef(const CharT * data_, size_t size_) : data(reinterpret_cast<const char *>(data_)), size(size_)
{
/// Sanity check for overflowed values.
assert(size < 0x8000000000000000ULL);
}

constexpr StringRef(const char * data_, size_t size_) : data(data_), size(size_) {}

StringRef(const std::string & s) : data(s.data()), size(s.size()) {}
constexpr explicit StringRef(const std::string_view & s) : data(s.data()), size(s.size()) {}
constexpr explicit StringRef(std::string_view s) : data(s.data()), size(s.size()) {}
constexpr StringRef(const char * data_) : StringRef(std::string_view{data_}) {}
constexpr StringRef() = default;

Expand All @@ -45,6 +50,12 @@ struct StringRef
constexpr explicit operator std::string_view() const { return {data, size}; }
};

/// Here constexpr doesn't implicate inline, see https://www.viva64.com/en/w/v1043/
/// nullptr can't be used because the StringRef values are used in SipHash's pointer arithmetics
/// and the UBSan thinks that something like nullptr + 8 is UB.
constexpr const inline char empty_string_ref_addr{};
constexpr const inline StringRef EMPTY_STRING_REF{&empty_string_ref_addr, 0};

using StringRefs = std::vector<StringRef>;


Expand Down
9 changes: 9 additions & 0 deletions src/Columns/ColumnLowCardinality.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,15 @@ namespace ErrorCodes
extern const int LOGICAL_ERROR;
}

/**
* How data is stored (in a nutshell):
* we have a dictionary @e reverse_index in ColumnUnique that holds pairs (DataType, UIntXX) and a column
* with UIntXX holding actual data indices.
* To obtain the value's index, call #getOrFindIndex.
* To operate on the data (so called indices column), call #getIndexes.
*
* @note The indices column always contains the default value (empty StringRef) with the first index.
*/
class ColumnLowCardinality final : public COWHelper<IColumn, ColumnLowCardinality>
{
friend class COWHelper<IColumn, ColumnLowCardinality>;
Expand Down
6 changes: 0 additions & 6 deletions src/Columns/ColumnNullable.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ namespace DB

namespace ErrorCodes
{
extern const int NOT_IMPLEMENTED;
extern const int LOGICAL_ERROR;
extern const int ILLEGAL_COLUMN;
extern const int SIZES_OF_NESTED_COLUMNS_ARE_INCONSISTENT;
Expand Down Expand Up @@ -105,11 +104,6 @@ void ColumnNullable::get(size_t n, Field & res) const
getNestedColumn().get(n, res);
}

StringRef ColumnNullable::getDataAt(size_t /*n*/) const
{
throw Exception{"Method getDataAt is not supported for " + getName(), ErrorCodes::NOT_IMPLEMENTED};
}

void ColumnNullable::insertData(const char * pos, size_t length)
{
if (pos == nullptr)
Expand Down
15 changes: 13 additions & 2 deletions src/Columns/ColumnNullable.h
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,20 @@ class ColumnNullable final : public COWHelper<IColumn, ColumnNullable>
bool isNullAt(size_t n) const override { return assert_cast<const ColumnUInt8 &>(*null_map).getData()[n] != 0;}
Field operator[](size_t n) const override;
void get(size_t n, Field & res) const override;
bool getBool(size_t n) const override { return isNullAt(n) ? 0 : nested_column->getBool(n); }
bool getBool(size_t n) const override { return isNullAt(n) ? false : nested_column->getBool(n); }
UInt64 get64(size_t n) const override { return nested_column->get64(n); }
StringRef getDataAt(size_t n) const override;

/**
* If isNullAt(n) returns false, returns the nested column's getDataAt(n), otherwise returns a special value
* EMPTY_STRING_REF indicating that data is not present.
*/
StringRef getDataAt(size_t n) const override
{
if (isNullAt(n))
return EMPTY_STRING_REF;

return getNestedColumn().getDataAt(n);
}

/// Will insert null value if pos=nullptr
void insertData(const char * pos, size_t length) override;
Expand Down
60 changes: 35 additions & 25 deletions src/Columns/ColumnUnique.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
#include <ext/range.h>

#include <common/unaligned.h>
#include "Columns/ColumnConst.h"


namespace DB
Expand Down Expand Up @@ -90,14 +91,13 @@ class ColumnUnique final : public COWHelper<IColumnUnique, ColumnUnique<ColumnTy
void protect() override { column_holder->protect(); }
size_t allocatedBytes() const override
{
return column_holder->allocatedBytes()
+ index.allocatedBytes()
+ (nested_null_mask ? nested_null_mask->allocatedBytes() : 0);
return column_holder->allocatedBytes() + reverse_index.allocatedBytes()
+ (nested_null_mask ? nested_null_mask->allocatedBytes() : 0);
}
void forEachSubcolumn(IColumn::ColumnCallback callback) override
{
callback(column_holder);
index.setColumn(getRawColumnPtr());
reverse_index.setColumn(getRawColumnPtr());
if (is_nullable)
nested_column_nullable = ColumnNullable::create(column_holder, nested_null_mask);
}
Expand All @@ -109,16 +109,30 @@ class ColumnUnique final : public COWHelper<IColumnUnique, ColumnUnique<ColumnTy
return false;
}

const UInt64 * tryGetSavedHash() const override { return index.tryGetSavedHash(); }
const UInt64 * tryGetSavedHash() const override { return reverse_index.tryGetSavedHash(); }

UInt128 getHash() const override { return hash.getHash(*getRawColumnPtr()); }

private:
std::optional<UInt64> getOrFindValueIndex(StringRef value) const override
{
if (std::optional<UInt64> res = reverse_index.getIndex(value); res)
return res;

auto& nested = *getNestedColumn();

for (size_t i = 0; i < nested.size(); ++i)
if (nested.getDataAt(i) == value)
return i;

return {};
}

private:
IColumn::WrappedPtr column_holder;
bool is_nullable;
size_t size_of_value_if_fixed = 0;
ReverseIndex<UInt64, ColumnType> index;

ReverseIndex<UInt64, ColumnType> reverse_index;

/// For DataTypeNullable, stores null map.
IColumn::WrappedPtr nested_null_mask;
Expand Down Expand Up @@ -169,21 +183,20 @@ template <typename ColumnType>
ColumnUnique<ColumnType>::ColumnUnique(const ColumnUnique & other)
: column_holder(other.column_holder)
, is_nullable(other.is_nullable)
, size_of_value_if_fixed (other.size_of_value_if_fixed)
, index(numSpecialValues(is_nullable), 0)
, size_of_value_if_fixed(other.size_of_value_if_fixed)
, reverse_index(numSpecialValues(is_nullable), 0)
{
index.setColumn(getRawColumnPtr());
reverse_index.setColumn(getRawColumnPtr());
createNullMask();
}

template <typename ColumnType>
ColumnUnique<ColumnType>::ColumnUnique(const IDataType & type)
: is_nullable(type.isNullable())
, index(numSpecialValues(is_nullable), 0)
: is_nullable(type.isNullable()), reverse_index(numSpecialValues(is_nullable), 0)
{
const auto & holder_type = is_nullable ? *static_cast<const DataTypeNullable &>(type).getNestedType() : type;
column_holder = holder_type.createColumn()->cloneResized(numSpecialValues());
index.setColumn(getRawColumnPtr());
reverse_index.setColumn(getRawColumnPtr());
createNullMask();

if (column_holder->valuesHaveFixedSize())
Expand All @@ -192,16 +205,14 @@ ColumnUnique<ColumnType>::ColumnUnique(const IDataType & type)

template <typename ColumnType>
ColumnUnique<ColumnType>::ColumnUnique(MutableColumnPtr && holder, bool is_nullable_)
: column_holder(std::move(holder))
, is_nullable(is_nullable_)
, index(numSpecialValues(is_nullable_), 0)
: column_holder(std::move(holder)), is_nullable(is_nullable_), reverse_index(numSpecialValues(is_nullable_), 0)
{
if (column_holder->size() < numSpecialValues())
throw Exception("Too small holder column for ColumnUnique.", ErrorCodes::ILLEGAL_COLUMN);
if (isColumnNullable(*column_holder))
throw Exception("Holder column for ColumnUnique can't be nullable.", ErrorCodes::ILLEGAL_COLUMN);

index.setColumn(getRawColumnPtr());
reverse_index.setColumn(getRawColumnPtr());
createNullMask();

if (column_holder->valuesHaveFixedSize())
Expand Down Expand Up @@ -288,12 +299,10 @@ size_t ColumnUnique<ColumnType>::uniqueInsertFrom(const IColumn & src, size_t n)
template <typename ColumnType>
size_t ColumnUnique<ColumnType>::uniqueInsertData(const char * pos, size_t length)
{
auto column = getRawColumnPtr();
if (auto index = getNestedTypeDefaultValueIndex(); getRawColumnPtr()->getDataAt(index) == StringRef(pos, length))
return index;

if (column->getDataAt(getNestedTypeDefaultValueIndex()) == StringRef(pos, length))
return getNestedTypeDefaultValueIndex();

auto insertion_point = index.insert(StringRef(pos, length));
auto insertion_point = reverse_index.insert({pos, length});

updateNullMask();

Expand All @@ -320,6 +329,7 @@ StringRef ColumnUnique<ColumnType>::serializeValueIntoArena(size_t n, Arena & ar
return StringRef(nested_ref.data - s, nested_ref.size + s);
}


return column_holder->serializeValueIntoArena(n, arena, begin);
}

Expand Down Expand Up @@ -513,14 +523,14 @@ MutableColumnPtr ColumnUnique<ColumnType>::uniqueInsertRangeImpl(

if (secondary_index && next_position >= max_dictionary_size)
{
auto insertion_point = index.getInsertionPoint(ref);
if (insertion_point == index.lastInsertionPoint())
auto insertion_point = reverse_index.getInsertionPoint(ref);
if (insertion_point == reverse_index.lastInsertionPoint())
res = insert_key(ref, *secondary_index);
else
positions[num_added_rows] = insertion_point;
}
else
res = insert_key(ref, index);
res = insert_key(ref, reverse_index);

if (res)
return res;
Expand Down
40 changes: 23 additions & 17 deletions src/Columns/ColumnVector.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,31 +22,31 @@ namespace ErrorCodes
* Floating-point numbers are compared this way that NaNs always end up at the end
* (if you don't do this, the sort would not work at all).
*/
template <typename T>
template <class T, class U = T>
struct CompareHelper
{
static bool less(T a, T b, int /*nan_direction_hint*/) { return a < b; }
static bool greater(T a, T b, int /*nan_direction_hint*/) { return a > b; }
static constexpr bool less(T a, U b, int /*nan_direction_hint*/) { return a < b; }
static constexpr bool greater(T a, U b, int /*nan_direction_hint*/) { return a > b; }

/** Compares two numbers. Returns a number less than zero, equal to zero, or greater than zero if a < b, a == b, a > b, respectively.
* If one of the values is NaN, then
* - if nan_direction_hint == -1 - NaN are considered less than all numbers;
* - if nan_direction_hint == 1 - NaN are considered to be larger than all numbers;
* Essentially: nan_direction_hint == -1 says that the comparison is for sorting in descending order.
*/
static int compare(T a, T b, int /*nan_direction_hint*/)
static constexpr int compare(T a, U b, int /*nan_direction_hint*/)
{
return a > b ? 1 : (a < b ? -1 : 0);
}
};

template <typename T>
template <class T>
struct FloatCompareHelper
{
static bool less(T a, T b, int nan_direction_hint)
static constexpr bool less(T a, T b, int nan_direction_hint)
{
bool isnan_a = std::isnan(a);
bool isnan_b = std::isnan(b);
const bool isnan_a = std::isnan(a);
const bool isnan_b = std::isnan(b);

if (isnan_a && isnan_b)
return false;
Expand All @@ -58,10 +58,10 @@ struct FloatCompareHelper
return a < b;
}

static bool greater(T a, T b, int nan_direction_hint)
static constexpr bool greater(T a, T b, int nan_direction_hint)
{
bool isnan_a = std::isnan(a);
bool isnan_b = std::isnan(b);
const bool isnan_a = std::isnan(a);
const bool isnan_b = std::isnan(b);

if (isnan_a && isnan_b)
return false;
Expand All @@ -73,10 +73,11 @@ struct FloatCompareHelper
return a > b;
}

static int compare(T a, T b, int nan_direction_hint)
static constexpr int compare(T a, T b, int nan_direction_hint)
{
bool isnan_a = std::isnan(a);
bool isnan_b = std::isnan(b);
const bool isnan_a = std::isnan(a);
const bool isnan_b = std::isnan(b);

if (unlikely(isnan_a || isnan_b))
{
if (isnan_a && isnan_b)
Expand All @@ -91,9 +92,8 @@ struct FloatCompareHelper
}
};

template <> struct CompareHelper<Float32> : public FloatCompareHelper<Float32> {};
template <> struct CompareHelper<Float64> : public FloatCompareHelper<Float64> {};

template <class U> struct CompareHelper<Float32, U> : public FloatCompareHelper<Float32> {};
template <class U> struct CompareHelper<Float64, U> : public FloatCompareHelper<Float64> {};

/** A template for columns that use a simple array to store.
*/
Expand Down Expand Up @@ -201,6 +201,12 @@ class ColumnVector final : public COWHelper<ColumnVectorHelper, ColumnVector<T>>
data.push_back(value);
}

template <class U>
constexpr int compareAtOther(size_t n, size_t m, const ColumnVector<U> & rhs, int nan_direction_hint) const
{
return CompareHelper<T, U>::compare(data[n], rhs.data[m], nan_direction_hint);
}

/// This method implemented in header because it could be possibly devirtualized.
int compareAt(size_t n, size_t m, const IColumn & rhs_, int nan_direction_hint) const override
{
Expand Down
20 changes: 20 additions & 0 deletions src/Columns/IColumnUnique.h
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#pragma once
#include <optional>
#include <Columns/IColumn.h>
#include <Common/UInt128.h>

Expand All @@ -9,6 +10,7 @@ namespace ErrorCodes
extern const int NOT_IMPLEMENTED;
}

/// Sort of a dictionary
class IColumnUnique : public IColumn
{
public:
Expand Down Expand Up @@ -68,6 +70,24 @@ class IColumnUnique : public IColumn
const char * getFamilyName() const override { return "ColumnUnique"; }
TypeIndex getDataType() const override { return getNestedColumn()->getDataType(); }

/**
* Given some value (usually, of type @e ColumnType) @p value that is convertible to DB::StringRef, obtains its
* index in the DB::ColumnUnique::reverse_index hashtable.
*
* The reverse index (StringRef => UInt64) is built lazily, so there are two variants:
* - On the function call it's present. Therefore we obtain the index in O(1).
* - The reverse index is absent. We search for the index linearly.
*
* @see DB::ReverseIndex
* @see DB::ColumnUnique
*
* The most common example uses https://clickhouse.tech/docs/en/sql-reference/data-types/lowcardinality/ columns.
* Consider data type @e LC(String). The inner type here is @e String which is more or less a contigous memory
* region, so it can be easily represented as a @e StringRef. So we pass that ref to this function and get its
* index in the dictionary, which can be used to operate with the indices column.
*/
virtual std::optional<UInt64> getOrFindValueIndex(StringRef value) const = 0;

void insert(const Field &) override
{
throw Exception("Method insert is not supported for ColumnUnique.", ErrorCodes::NOT_IMPLEMENTED);
Expand Down
Loading