Skip to content

perf(sql): optimize count_distinct symbol function#3974

Merged
bluestreak01 merged 6 commits intomasterfrom
puzpuzpuz_optimize_symbol_count_distinct
Nov 20, 2023
Merged

perf(sql): optimize count_distinct symbol function#3974
bluestreak01 merged 6 commits intomasterfrom
puzpuzpuz_optimize_symbol_count_distinct

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Nov 17, 2023

Includes the following optimizations:

  • Replaces CompactIntHashSet with BitSet in CountDistinctSymbolGroupByFunction. The former has a much more compact memory layout and faster operations.
  • Introduces early exit for non-keyed queries with count_distinct() over symbol columns, e.g. SELECT count_distinct(sym_col) FROM my_table;. The exit condition is based on the symbol table sizes: as soon as they're reached, row iteration stops.
CREATE TABLE x AS (
  SELECT rnd_symbol(1000, 10, 10, 0) sym1,
         rnd_symbol('a', 'b', 'c') sym2,
         timestamp_sequence('2023-11-17T10', 1000000L) ts
  FROM long_sequence(100000000)
) TIMESTAMP(ts) PARTITION BY DAY;

-- before 1s
-- after 10ms
SELECT count_distinct(sym1) FROM x;

-- before 2s
-- after 10ms
SELECT count_distinct(sym1), count_distinct(sym2) FROM x;

-- before 2.5s
-- after 2.4s
SELECT sym2, count_distinct(sym1) FROM x;

@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Nov 17, 2023
@puzpuzpuz puzpuzpuz self-assigned this Nov 17, 2023
Includes the following optimizations:
* Replaces CompactIntHashSet with BitSet in CountDistinctSymbolGroupByFunction.
  The former has a much more compact memory layout and faster operations.
* Introduces early exit for SELECT count_distinct(sym_col) FROM my_table; queries.
  Exit condition is based on the symbol table size: as soon as it's reached,
  row iteration stops.
@puzpuzpuz puzpuzpuz force-pushed the puzpuzpuz_optimize_symbol_count_distinct branch from 0320c1c to 98e3079 Compare November 17, 2023 09:30
@puzpuzpuz puzpuzpuz marked this pull request as ready for review November 17, 2023 09:39
@eugenels eugenels self-requested a review November 17, 2023 10:28
@bluestreak01
Copy link
Copy Markdown
Member

@puzpuzpuz this is a very clever improvement 👍 how will it work with where clause?

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

@puzpuzpuz this is a very clever improvement 👍 how will it work with where clause?

If the filtered symbol values are only a subset of the full symbol table, the early exit check will be no-op (full loop over the base cursor). If all symbol values are returned by the filter, the check will kick in and redundant iteration over the base cursor will stop. That's done only in the GroupByNotKeyedRecordCursorFactory, so all other factories behave in the same way.

@eugenels eugenels self-requested a review November 20, 2023 12:47
eugenels
eugenels previously approved these changes Nov 20, 2023
@ideoma
Copy link
Copy Markdown
Collaborator

ideoma commented Nov 20, 2023

[PR Coverage check]

😍 pass : 68 / 71 (95.77%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctUuidGroupByFunction.java 0 1 00.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctIPv4GroupByFunction.java 0 1 00.00%
🔵 io/questdb/griffin/engine/functions/GroupByFunction.java 1 2 50.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctStringGroupByFunction.java 1 1 100.00%
🔵 io/questdb/std/BitSet.java 22 22 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctSymbolGroupByFunction.java 19 19 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctLongGroupByFunction.java 1 1 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctLong256GroupByFunction.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/GroupByNotKeyedRecordCursorFactory.java 22 22 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/CountDistinctIntGroupByFunction.java 1 1 100.00%

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

@eugenels thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants