feat(sql): implement mode(Ø/T/L/K/S/D) group by functions#5623
feat(sql): implement mode(Ø/T/L/K/S/D) group by functions#5623bluestreak01 merged 35 commits intomasterfrom
Conversation
… off-heap hashmap, passable by pointer, for use in group by queries.
|
Is there any reason that mode cannot be computed on a double value? My use case is the following: I have a devices that sends timestamped values like this: I.e. I dont have a definitive timestamp for the value. The problem with this is of cause that the time on the device may not be exact, so when I receive this type of data in the beginning/end of the month, it may be that the month has changed in the device but not when the receiver makes its timestamp. My idea to handle this in questdb is to store all values like this: device_time TIMESTAMP The device time may be set to an incorrect month, however if I could do: Then that would give me the correct value for the month, as it solves the problem by selecting the most observed value in the month. |
|
Hi @rmja , thanks for the details. No particular reason, just the original use case didn't require it. We can have more options. However, keeping cardinality in check is important, else you will exhaust RAM building hashmaps. This is on small pause for the time being due to other priorities, but I will wrap it up when I can. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds mode() aggregate functions for boolean, double, long, string, symbol, and varchar; supporting factories; allocator-backed group-by hash maps for long/charsequence/utf8; benchmarks comparing group-by maps vs standard maps; API extensions in std maps; visibility change for LongLongHashMap constructor; and comprehensive unit/integration tests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant SQL as SQL Planner
participant Exec as Execution Engine
participant GB as GroupBy Operator
participant Agg as mode() Aggregator
participant Map as GroupBy* Map
Client->>SQL: SELECT ... mode(col) FROM t [GROUP BY k] [SAMPLE BY ...]
SQL->>Exec: Plan with mode() (per-type factory)
Exec->>GB: Initialize groups and aggregators
GB->>Agg: initValueTypes/setAllocator
loop scan records
GB->>Agg: computeFirst/computeNext(mapValue, record, rowId)
Agg->>Map: inc(key) / put(key,1)
Agg-->>GB: store/update map pointer in mapValue
end
par parallel workers (optional)
GB->>Agg: merge(destValue, srcValue)
Agg->>Map: mergeAdd(srcMap)
end
GB->>Agg: getX(record) for result materialization
Agg->>Map: iterate to find max-count key
Agg-->>GB: mode value
GB-->>Exec: grouped rows with mode()
Exec-->>Client: result set
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 16
♻️ Duplicate comments (5)
core/src/main/java/io/questdb/griffin/engine/groupby/GroupByCharSequenceLongHashMap.java (1)
34-34: Previous review comment not addressed.The earlier review noted that this class owns its memory and thus isn't truly a flyweight. The javadoc still describes it as a "flyweight hash map," which remains misleading.
benchmarks/src/main/java/org/questdb/GroupByCharSequenceLongHashMapBenchmark.java (2)
53-59: Static fields with@State(Scope.Thread)may cause issues.Same issue as in
GroupByLongLongHashMapBenchmark: static fields are shared across threads if JMH runs multiple threads. Consider making these instance fields to align withScope.Threadsemantics.
74-81: Verify allocator reuse after close().Same issue as in
GroupByLongLongHashMapBenchmark: Line 76 callsallocator.close(), and line 77 callssetAllocator(allocator). Ensure the allocator can be reused afterclose().core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java (1)
46-47: Make initialCapacity and loadFactor final constants.These fields are initialized once in the class and never modified. They should be declared as
finalconstants to prevent accidental modification and clearly express intent.Apply this diff:
- int initialCapacity = 4; - double loadFactor = 0.7d; + private static final int INITIAL_CAPACITY = 4; + private static final double LOAD_FACTOR = 0.7;Then update the map initializations on lines 48-49:
- GroupByCharSequenceLongHashMap mapA = new GroupByCharSequenceLongHashMap(initialCapacity, loadFactor, LONG_NULL, LONG_NULL); - GroupByCharSequenceLongHashMap mapB = new GroupByCharSequenceLongHashMap(initialCapacity, loadFactor, LONG_NULL, LONG_NULL); + GroupByCharSequenceLongHashMap mapA = new GroupByCharSequenceLongHashMap(INITIAL_CAPACITY, LOAD_FACTOR, LONG_NULL, LONG_NULL); + GroupByCharSequenceLongHashMap mapB = new GroupByCharSequenceLongHashMap(INITIAL_CAPACITY, LOAD_FACTOR, LONG_NULL, LONG_NULL);Based on past review comments.
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java (1)
180-201: Verify determinism when multiple values share the maximum count.Similar to
ModeSymbolGroupByFunction, when multiple varchar values tie for the maximum count, the implementation returns the first one encountered during hash map iteration, which may be non-deterministic.This is a duplicate concern across all Mode implementations. The verification script in the
ModeSymbolGroupByFunctionreview will cover this as well.
🧹 Nitpick comments (23)
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactory.java (1)
34-56: Factory implementation follows standard pattern.The factory correctly:
- Returns signature
"mode(Ø)"for VARCHAR type- Indicates participation in group-by via
isGroupBy() = true- Constructs
ModeVarcharGroupByFunctionwith the first argumentThis is consistent with the QuestDB function factory pattern.
Note: The implementation assumes the planner ensures
argshas at least one element. If you want defensive validation, consider adding:public Function newInstance( int position, ObjList<Function> args, IntList argPositions, CairoConfiguration configuration, SqlExecutionContext sqlExecutionContext ) { + if (args.size() != 1) { + throw SqlException.$(position, "mode() expects exactly 1 argument"); + } return new ModeVarcharGroupByFunction(args.getQuick(0)); }However, this validation is typically done by the planner, so omitting it here is acceptable.
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunction.java (2)
88-109: Mode-finding algorithm is correct but could be optimized.The linear scan through all map slots to find the maximum count is correct but not optimal for large maps.
Consider tracking the mode key/count during updates (in
computeFirst/computeNext) rather than scanning the entire map ingetLong(). This would reduce the time complexity from O(capacity) to O(1) for retrieval, at the cost of slightly more work during updates.However, if this is deferred optimization and current performance is acceptable, the current implementation is correct and safe.
46-49: Consider making initialCapacity and loadFactor configurable.The hardcoded
initialCapacity = 4andloadFactor = 0.7dmay not be optimal for all use cases. For high-cardinality columns (as warned in PR comments), this could lead to many rehashes.Consider:
- Making these configurable via constructor parameters or configuration
- Adding documentation about memory implications for high-cardinality data
- Exposing metrics for map size/rehash counts for observability
This aligns with the PR comment from nwoolmer warning about "cardinality and RAM usage for hashmaps."
core/src/main/java/io/questdb/griffin/engine/groupby/GroupByLongLongHashMap.java (3)
78-93: inc(key, delta) does not validate delta sign.The method increments a value by
deltawithout checking ifdeltais negative. While this might be intentional for flexibility, negative deltas could produce nonsensical counts in the context of mode aggregation (where counts represent frequencies and should be non-negative).If negative deltas are not a valid use case, add validation:
public void inc(long key, long delta) { if (key != noKeyValue) { + if (delta < 0) { + throw new IllegalArgumentException("delta must be non-negative"); + } long index = keyIndex(key);Alternatively, document that negative deltas are supported if there's a valid use case (e.g., decrements).
110-123: of(long ptr) allocates new map even if ptr is 0—clarify intent.When
ptr == 0, the method allocates a new map and initializes it. However, the method nameofsuggests it's restoring from an existing pointer. The behavior is correct but potentially confusing.Consider renaming to
ofOrAllocate(long ptr)or add a javadoc comment:+ /** + * Restores the map from the given pointer, or allocates a new map if ptr is 0. + */ public GroupByLongLongHashMap of(long ptr) {
182-209: Rehashing frees old buffer before verifying new map is valid.Line 208 calls
allocator.free(oldPtr, ...)immediately after rehashing. If there's a bug in the rehashing logic (e.g., incorrectkeyIndexcalculation), the old data is already lost, making debugging difficult.For safer debugging, consider deferring the free until after a validation pass (in debug builds):
} + // In debug mode, validate the rehashed map before freeing old buffer + assert validateRehash(oldPtr, oldCapacity); allocator.free(oldPtr, HEADER_SIZE + 16L * oldCapacity); } + +private boolean validateRehash(long oldPtr, int oldCapacity) { + // Check that all old entries are present in new map + for (long p = oldPtr + HEADER_SIZE, lim = oldPtr + HEADER_SIZE + 16L * oldCapacity; p < lim; p += 16L) { + long key = Unsafe.getUnsafe().getLong(p); + if (key != noKeyValue) { + long value = Unsafe.getUnsafe().getLong(p + 8L); + assert get(key) == value : "rehash validation failed"; + } + } + return true; +}This is optional and primarily for debugging.
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunction.java (1)
46-49: Extract initialCapacity and loadFactor into configurable parameters
All Mode*GroupByFunction implementations (String, Varchar, Symbol, Long, Double) currently hardcodeinitialCapacity = 4andloadFactor = 0.7d; consider exposing these via constructor arguments or central configuration and tuning based on expected cardinality or query hints.benchmarks/src/main/java/org/questdb/GroupByLongLongHashMapBenchmark.java (1)
54-62: Static fields with@State(Scope.Thread)may cause issues.The class uses
@State(Scope.Thread), but the allocator, maps, and RNG are declared asstaticfields. This means they are shared across all threads if JMH runs multiple threads. If the benchmark is intended to be single-threaded (forks=1 in main), this is fine, but the pattern is unconventional. Typically, fields should be instance fields when usingScope.Thread.Consider making these instance fields to align with
Scope.Threadsemantics:- private static final GroupByAllocator allocator = new FastGroupByAllocator(128 * 1024, Numbers.SIZE_1GB); + private GroupByAllocator allocator = new FastGroupByAllocator(128 * 1024, Numbers.SIZE_1GB);(Apply similar changes to other static fields, and adjust
reset()to recreate the allocator instead of closing and reusing it.)core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java (1)
78-80: Consider validating the index parameter.The
inc(int index)method assumes the index is negative (occupied slot) and directly accessesvalues[-index - 1]. If called with a positive index, this could cause array bounds issues.Consider adding a precondition check:
public void inc(int index) { + assert index < 0 : "inc(int) expects a negative index from keyIndex()"; values[-index - 1]++; }Or document the precondition in a JavaDoc comment.
core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java (1)
46-56: Document defaultfalsefallback in ModeBooleanGroupByFunction
The behaviour is intentional: when all inputs areNULL, both counters are set toLONG_NULLandgetBoolevaluatestrueCount > falseCountasfalse. Add a Javadoc note incore/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunction.java(and update the FunctionFactory contract) to explicitly document this default.core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java (1)
179-200: Optimize mode extraction with early exit.The current implementation scans the entire capacity even after finding the mode. While functionally correct, consider that for large maps this scans every slot including empty ones.
Consider tracking the maximum count during insertion/merge to avoid the full scan:
// Add field: private long cachedModeKey = LONG_NULL; private long cachedModeCount = -1; // Update in computeNext/merge to track max // Then getStr becomes: CharSequence getStr(Record record, GroupByCharSink sink) { long mapPtr = record.getLong(valueIndex); if (mapPtr <= 0) { return null; } mapA.of(mapPtr); if (cachedModeKey != LONG_NULL) { sink.of(cachedModeKey); return sink; } // fallback to scan... }Note: This optimization adds complexity and would require careful handling during merge operations.
core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8SequenceLongHashMap.java (1)
258-273: Probe method detects corruption but lacks reprobing limit.The probe method correctly implements linear probing and detects when the table is full by checking if it wraps around to the starting index. However, for a properly sized table this should rarely happen.
Consider whether the error message on line 272 should include diagnostic information (size, capacity, key) to aid debugging if corruption occurs in production.
- throw CairoException.critical(0).put("corrupt utf8 sequence long long hash map"); + throw CairoException.critical(0) + .put("corrupt utf8 sequence long hash map [size=").put(size()) + .put(", capacity=").put(capacity()) + .put(", key=").put(key) + .put(']');core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java (4)
51-54: Hard-coded capacity and load factor may not suit all workloads.The
initialCapacity = 4andloadFactor = 0.7dare hard-coded. For high-cardinality symbol columns, starting with capacity 4 may cause excessive resizing, while for low-cardinality columns it may waste memory. Consider making these configurable or adaptive based on symbol table characteristics.
51-54: Consider making capacity and load factor configurable or document the rationale.The hard-coded
initialCapacity = 4andloadFactor = 0.7dmay lead to early resizing in typical aggregation scenarios with moderate-to-high cardinality. If these values are based on empirical tuning or match conventions across other Mode implementations, consider adding a comment explaining the choice.
103-114: Capacity iteration is intentional—add clarifying comment
GroupByLongLongHashMap lacks an entry iterator; iterating over capacity with LONG_NULL checks is the correct approach. Add a comment above the loop explaining this intent for future maintainers.
51-54: Consider making initial capacity and load factor configurable.The hardcoded
initialCapacity = 4andloadFactor = 0.7dmay not be optimal for all use cases. For columns with high cardinality, a larger initial capacity could reduce resize operations. Consider exposing these as constructor parameters or configuration options, similar to other hash map implementations in the codebase.core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java (4)
47-52: Hard-coded capacity and load factor may not suit all workloads.Similar to
ModeSymbolGroupByFunction, theinitialCapacity = 4andloadFactor = 0.7dare hard-coded. For high-cardinality varchar columns (e.g., free-text fields), this may cause excessive resizing and memory churn.
47-52: Consider making capacity and load factor configurable or document the rationale.As with
ModeSymbolGroupByFunction, the hard-codedinitialCapacity = 4andloadFactor = 0.7dmay cause early resizing for moderate cardinality. Document the reasoning or consider making these tunable.
189-198: No change needed for capacity() loop
GroupByUtf8SequenceLongHashMap does not expose an iterator or entry-only traversal API, so iterating overcapacity()and filtering out empty slots is the correct—and only—way to find occupied entries. Optionally, add a brief code comment explaining this choice for future maintainers.
47-52: Consider making initial capacity and load factor configurable.Similar to
ModeSymbolGroupByFunction, the hardcodedinitialCapacity = 4andloadFactor = 0.7dmay not be optimal for high-cardinality VARCHAR columns. Consider exposing these as configuration options.core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactoryTest.java (2)
30-270: Consider adding floating-point precision edge case tests.While the test suite is comprehensive, it doesn't explicitly test scenarios where floating-point precision matters:
- Very close but distinct values (e.g., 1.0 and 1.0 + Double.MIN_VALUE)
- NaN and Infinity handling
- Subnormal numbers
These edge cases could expose issues in the hash map key comparison logic for doubles.
30-271: Consider adding tests for NaN and Infinity.The test suite is comprehensive, but consider adding edge cases for
Double.NaNandDouble.POSITIVE_INFINITY/Double.NEGATIVE_INFINITYto ensure they're handled correctly (e.g., are NaN values treated as distinct or ignored? What happens if all values are NaN?).benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java (1)
86-93: Hoist ptr retrieval outside the loopGroupByUtf8SequenceLongHashMap.ptr() never changes on put (no relocation), so updating mapPtr after each iteration is redundant—move
mapPtr = …ptr()to before the loop.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (34)
benchmarks/src/main/java/org/questdb/GroupByCharSequenceLongHashMapBenchmark.java(1 hunks)benchmarks/src/main/java/org/questdb/GroupByLongLongHashMapBenchmark.java(1 hunks)benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunction.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunction.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunction.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactory.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/groupby/GroupByCharSequenceLongHashMap.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/groupby/GroupByLongLongHashMap.java(1 hunks)core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8SequenceLongHashMap.java(1 hunks)core/src/main/java/io/questdb/std/CharSequenceLongHashMap.java(1 hunks)core/src/main/java/io/questdb/std/IntLongHashMap.java(1 hunks)core/src/main/java/io/questdb/std/LongLongHashMap.java(1 hunks)core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactoryTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeGroupByFunctionTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeIntegrationTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeLongGroupByFunctionFactoryTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactoryTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeSymbolGroupByFunctionFactoryTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactoryTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByCharSequenceLongHashMapTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByLongLongHashMapTest.java(1 hunks)core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByUtf8SequenceLongHashMapTest.java(1 hunks)core/src/test/java/io/questdb/test/std/Utf8SequenceLongHashMapTest.java(1 hunks)
👮 Files not reviewed due to content moderation or server errors (8)
- core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java
- core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeIntegrationTest.java
- benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java
- core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java
- core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactoryTest.java
- core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java
- core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java
- core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactoryTest.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
- GitHub Check: New pull request (Coverage Report Coverage Report)
- GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
- GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
- GitHub Check: New pull request (Hosted Running tests on windows-other)
- GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
- GitHub Check: New pull request (Hosted Running tests on windows-cairo)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
- GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
- GitHub Check: New pull request (Hosted Running tests on windows-griffin)
- GitHub Check: New pull request (Hosted Running tests on mac-other)
- GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
- GitHub Check: New pull request (Hosted Running tests on mac-cairo)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests on mac-griffin)
- GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
- GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
- GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
- GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
- GitHub Check: New pull request (Check Changes Check changes)
…o try to avoid future confusion
[PR Coverage check]😍 pass : 776 / 859 (90.34%) file detail
|
Closes #5580
mode()returns the most frequent appearance of a column value, for each group. Withoutmode(), you must first perform acount()operation. Then you must sort the result set and take the value corresponding to the highest count. This requires complex subqueries.This has a variety of uses. For example, returning the most frequently traded symbol by a trader, or identifying the most frequent state for a sensor.
It is implemented using a map of maps, where the outer map stores pointers to inner maps, and each inner map aggregates a per-key count. These are merged into one set of top level maps, which are then scanned for the most frequent value.
todo:
mode(BOOLEAN)GroupByLongLongHashMapmode(LONG)run in parallel using the new off-heap mapmode(VARCHAR)run in parallel using the new off-heap mapmode(STRING)run in parallel using the new off-heap mapmode(SYMBOL)run in parallel using the new off-heap mapmode(DOUBLE)run in parallel using the new off-heap mapBenchmark
For
GroupByLongHashSet,GroupByLongLongHashMap,LongLongHashMap: (not isolated)Details
For
GroupbyUtf8SequenceLongHashMapandUtf8SequenceLongHashMap: (not isolated)Details
Benchmark (size) Mode Cnt Score Error Units GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 10 avgt 3 1078.801 ± 90.283 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 50 avgt 3 2287.050 ± 213.875 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 200 avgt 3 7145.411 ± 86.704 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 1000 avgt 3 34169.157 ± 2721.831 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 10 avgt 3 778.603 ± 51.911 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 50 avgt 3 1972.678 ± 376.840 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 200 avgt 3 6501.634 ± 133.595 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 1000 avgt 3 29782.794 ± 2519.313 ns/opFor
GroupbyCharSequenceLongHashMapandCharSequenceLongHashMap: (not isolated)Details
Benchmark (size) Mode Cnt Score Error Units GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 10 avgt 3 503.610 ± 415.333 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 50 avgt 3 756.659 ± 178.424 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 200 avgt 3 2039.515 ± 222.997 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 1000 avgt 3 8263.961 ± 811.661 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 10 avgt 3 583.044 ± 105.280 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 50 avgt 3 1159.440 ± 954.187 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 200 avgt 3 3150.716 ± 132.363 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 1000 avgt 3 11260.364 ± 256.576 ns/op