Skip to content

feat(sql): implement mode(Ø/T/L/K/S/D) group by functions#5623

Merged
bluestreak01 merged 35 commits intomasterfrom
nw_mode
Oct 1, 2025
Merged

feat(sql): implement mode(Ø/T/L/K/S/D) group by functions#5623
bluestreak01 merged 35 commits intomasterfrom
nw_mode

Conversation

@nwoolmer
Copy link
Copy Markdown
Contributor

@nwoolmer nwoolmer commented Apr 24, 2025

Closes #5580

mode() returns the most frequent appearance of a column value, for each group. Without mode(), you must first perform a count() operation. Then you must sort the result set and take the value corresponding to the highest count. This requires complex subqueries.

This has a variety of uses. For example, returning the most frequently traded symbol by a trader, or identifying the most frequent state for a sensor.

It is implemented using a map of maps, where the outer map stores pointers to inner maps, and each inner map aggregates a per-key count. These are merged into one set of top level maps, which are then scanned for the most frequent value.

todo:

  • implement single-threaded versions of the functions
  • implement parallel mode(BOOLEAN)
  • add new GroupByLongLongHashMap
  • add benchmark for new map
  • add tests for new map
  • make mode(LONG) run in parallel using the new off-heap map
  • make mode(VARCHAR) run in parallel using the new off-heap map
  • make mode(STRING) run in parallel using the new off-heap map
  • make mode(SYMBOL) run in parallel using the new off-heap map
  • make mode(DOUBLE) run in parallel using the new off-heap map
  • add tests for all six functions

Benchmark

For GroupByLongHashSet, GroupByLongLongHashMap, LongLongHashMap: (not isolated)

Details
Benchmark (size) Mode Cnt Score Error Units
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 5000 avgt 3 8.506 1.401 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 50000 avgt 3 14.323 2.399 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 500000 avgt 3 31.933 4.347 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongHashSet 5000000 avgt 3 52.546 16.526 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 5000 avgt 3 14.065 0.740 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 50000 avgt 3 22.420 3.034 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 500000 avgt 3 56.049 12.629 ns/op
GroupByLongLongHashMapBenchmark.testGroupByLongLongHashMap 5000000 avgt 3 80.433 8.899 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 5000 avgt 3 16.445 0.487 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 50000 avgt 3 20.678 1.930 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 500000 avgt 3 53.339 21.817 ns/op
GroupByLongLongHashMapBenchmark.testLongLongHashMap 5000000 avgt 3 71.944 6.283 ns/op

For GroupbyUtf8SequenceLongHashMap and Utf8SequenceLongHashMap: (not isolated)

Details Benchmark (size) Mode Cnt Score Error Units GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 10 avgt 3 1078.801 ± 90.283 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 50 avgt 3 2287.050 ± 213.875 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 200 avgt 3 7145.411 ± 86.704 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testGroupByUtf8SequenceLongHashMap 1000 avgt 3 34169.157 ± 2721.831 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 10 avgt 3 778.603 ± 51.911 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 50 avgt 3 1972.678 ± 376.840 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 200 avgt 3 6501.634 ± 133.595 ns/op GroupByUtf8SequenceLongHashMapBenchmark.testUtf8SequenceLongHashMap 1000 avgt 3 29782.794 ± 2519.313 ns/op

For GroupbyCharSequenceLongHashMap and CharSequenceLongHashMap: (not isolated)

Details Benchmark (size) Mode Cnt Score Error Units GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 10 avgt 3 503.610 ± 415.333 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 50 avgt 3 756.659 ± 178.424 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 200 avgt 3 2039.515 ± 222.997 ns/op GroupByCharSequenceLongHashMapBenchmark.testCharSequenceLongHashMap 1000 avgt 3 8263.961 ± 811.661 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 10 avgt 3 583.044 ± 105.280 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 50 avgt 3 1159.440 ± 954.187 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 200 avgt 3 3150.716 ± 132.363 ns/op GroupByCharSequenceLongHashMapBenchmark.testGroupByCharSequenceLongHashMap 1000 avgt 3 11260.364 ± 256.576 ns/op

@nwoolmer nwoolmer added the SQL Issues or changes relating to SQL execution label Apr 24, 2025
@nwoolmer nwoolmer changed the title implement mode(Ø/T/L/K/S) group by function feat(sql): implement mode(Ø/T/L/K/S) group by function Apr 24, 2025
… off-heap hashmap, passable by pointer, for use in group by queries.
@nwoolmer nwoolmer changed the title feat(sql): implement mode(Ø/T/L/K/S) group by function feat(sql): implement mode(Ø/T/L/K/S) group by functions Apr 25, 2025
@rmja
Copy link
Copy Markdown
Contributor

rmja commented Jun 17, 2025

Is there any reason that mode cannot be computed on a double value?

My use case is the following:

I have a devices that sends timestamped values like this:
"volume was XXXm3 at the start of this month"

I.e. I dont have a definitive timestamp for the value. The problem with this is of cause that the time on the device may not be exact, so when I receive this type of data in the beginning/end of the month, it may be that the month has changed in the device but not when the receiver makes its timestamp.

My idea to handle this in questdb is to store all values like this:

device_time TIMESTAMP
received_time TIMESTAMP
value DOUBLE

The device time may be set to an incorrect month, however if I could do:

SELECT mode(value) SAMPLE BY 1M

Then that would give me the correct value for the month, as it solves the problem by selecting the most observed value in the month.

@nwoolmer
Copy link
Copy Markdown
Contributor Author

Hi @rmja , thanks for the details.

No particular reason, just the original use case didn't require it. We can have more options. However, keeping cardinality in check is important, else you will exhaust RAM building hashmaps.

This is on small pause for the time being due to other priorities, but I will wrap it up when I can.

@nwoolmer nwoolmer changed the title feat(sql): implement mode(Ø/T/L/K/S) group by functions feat(sql): implement mode(Ø/T/L/K/S/D) group by functions Jun 21, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Sep 24, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds mode() aggregate functions for boolean, double, long, string, symbol, and varchar; supporting factories; allocator-backed group-by hash maps for long/charsequence/utf8; benchmarks comparing group-by maps vs standard maps; API extensions in std maps; visibility change for LongLongHashMap constructor; and comprehensive unit/integration tests.

Changes

Cohort / File(s) Summary of changes
Benchmarks
benchmarks/src/main/java/org/questdb/GroupByCharSequenceLongHashMapBenchmark.java, benchmarks/src/main/java/org/questdb/GroupByLongLongHashMapBenchmark.java, benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java
New JMH benchmarks comparing GroupBy* maps with standard maps across sizes; per-iteration setup/reset; main runners and @benchmark methods.
GroupBy hash maps (allocator-backed)
core/src/main/java/io/questdb/griffin/engine/groupby/GroupByCharSequenceLongHashMap.java, core/src/main/java/io/questdb/griffin/engine/groupby/GroupByLongLongHashMap.java, core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8SequenceLongHashMap.java
New flyweight hash maps with direct-memory layout, open addressing, pointer export/import, inc/put/get, mergeAdd, rehash, allocator integration.
Mode aggregators (functions)
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunction.java, .../ModeDoubleGroupByFunction.java, .../ModeLongGroupByFunction.java, .../ModeStringGroupByFunction.java, .../ModeSymbolGroupByFunction.java, .../ModeVarcharGroupByFunction.java
New GroupByFunction implementations computing mode per type; maintain per-group pointer to internal counting maps; implement computeFirst/Next, merge, getters, planning, allocator wiring.
Mode function factories
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactory.java, .../ModeDoubleGroupByFunctionFactory.java, .../ModeLongGroupByFunctionFactory.java, .../ModeStringGroupByFunctionFactory.java, .../ModeSymbolGroupByFunctionFactory.java, .../ModeVarcharGroupByFunctionFactory.java
New FunctionFactory classes exposing signatures (mode(T/D/L/S/K/Ø)), group-by flags, and newInstance wiring to corresponding functions.
Std map/API updates
core/src/main/java/io/questdb/std/CharSequenceLongHashMap.java, core/src/main/java/io/questdb/std/IntLongHashMap.java, core/src/main/java/io/questdb/std/LongLongHashMap.java, core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java
Added inc methods to CharSequenceLongHashMap and IntLongHashMap; made LongLongHashMap(int,double) constructor public; added new Utf8SequenceLongHashMap with put/get/inc/rehash APIs.
Tests — mode aggregators
core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java, .../ModeDoubleGroupByFunctionFactoryTest.java, .../ModeGroupByFunctionTest.java, .../ModeIntegrationTest.java, .../ModeLongGroupByFunctionFactoryTest.java, .../ModeStringGroupByFunctionFactoryTest.java, .../ModeSymbolGroupByFunctionFactoryTest.java, .../ModeVarcharGroupByFunctionFactoryTest.java
New suites covering nulls, mixed values, grouping, sample by, ordering/limits, unicode/special strings, random data, and integration scenarios for mode() across types.
Tests — group-by maps
core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByCharSequenceLongHashMapTest.java, .../GroupByLongLongHashMapTest.java, .../GroupByUtf8SequenceLongHashMapTest.java
New tests validating allocator-backed maps: insert/get/inc, collisions, rehash, mergeAdd, pointer reuse, unicode handling, and random workloads.
Tests — std map
core/src/test/java/io/questdb/test/std/Utf8SequenceLongHashMapTest.java
New tests for Utf8SequenceLongHashMap: put/get/inc, remove, clear, putAll, index/value semantics, random scenarios.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant SQL as SQL Planner
  participant Exec as Execution Engine
  participant GB as GroupBy Operator
  participant Agg as mode() Aggregator
  participant Map as GroupBy* Map
  Client->>SQL: SELECT ... mode(col) FROM t [GROUP BY k] [SAMPLE BY ...]
  SQL->>Exec: Plan with mode() (per-type factory)
  Exec->>GB: Initialize groups and aggregators
  GB->>Agg: initValueTypes/setAllocator
  loop scan records
    GB->>Agg: computeFirst/computeNext(mapValue, record, rowId)
    Agg->>Map: inc(key) / put(key,1)
    Agg-->>GB: store/update map pointer in mapValue
  end
  par parallel workers (optional)
    GB->>Agg: merge(destValue, srcValue)
    Agg->>Map: mergeAdd(srcMap)
  end
  GB->>Agg: getX(record) for result materialization
  Agg->>Map: iterate to find max-count key
  Agg-->>GB: mode value
  GB-->>Exec: grouped rows with mode()
  Exec-->>Client: result set
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested labels

Core, Performance

Suggested reviewers

  • puzpuzpuz
  • mtopolnik

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.23% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title Check ✅ Passed The title is a concise, single-sentence description highlighting the primary feature: implementing the SQL mode aggregate for group-by operations. It follows conventional commit style and correctly references the supported data types using shorthand notation. A teammate scanning the history can infer that this PR adds the mode() function across various input types.
Linked Issues Check ✅ Passed The linked issue #5580 requests a mode() aggregate function that returns the most frequent non-null value per group and supports both keyed and non-keyed queries. This PR implements mode() for boolean, long, double, varchar, string, and symbol types using off-heap group-by map implementations, integrates keyed and non-keyed execution, and includes comprehensive benchmarks and unit tests. These changes fully address the issue’s objectives by providing the requested functionality across all targeted data types.
Out of Scope Changes Check ✅ Passed All modifications in this PR—including new group-by function classes, custom off-heap map implementations, benchmark suites, test additions, and helper methods added to existing map classes—directly support the implementation and evaluation of the mode() aggregate. No unrelated or extraneous code changes are present beyond what is necessary for this feature. Therefore, there are no out-of-scope modifications in this pull request.
Description Check ✅ Passed The pull request description clearly outlines the addition of the mode() aggregate function, its motivation and use cases, the high-level implementation strategy using nested maps, and provides a completed checklist of related tasks and benchmark results that directly correspond to the changes in the diff.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nwoolmer nwoolmer marked this pull request as ready for review September 29, 2025 16:31
@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Sep 30, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

♻️ Duplicate comments (5)
core/src/main/java/io/questdb/griffin/engine/groupby/GroupByCharSequenceLongHashMap.java (1)

34-34: Previous review comment not addressed.

The earlier review noted that this class owns its memory and thus isn't truly a flyweight. The javadoc still describes it as a "flyweight hash map," which remains misleading.

benchmarks/src/main/java/org/questdb/GroupByCharSequenceLongHashMapBenchmark.java (2)

53-59: Static fields with @State(Scope.Thread) may cause issues.

Same issue as in GroupByLongLongHashMapBenchmark: static fields are shared across threads if JMH runs multiple threads. Consider making these instance fields to align with Scope.Thread semantics.


74-81: Verify allocator reuse after close().

Same issue as in GroupByLongLongHashMapBenchmark: Line 76 calls allocator.close(), and line 77 calls setAllocator(allocator). Ensure the allocator can be reused after close().

core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java (1)

46-47: Make initialCapacity and loadFactor final constants.

These fields are initialized once in the class and never modified. They should be declared as final constants to prevent accidental modification and clearly express intent.

Apply this diff:

-    int initialCapacity = 4;
-    double loadFactor = 0.7d;
+    private static final int INITIAL_CAPACITY = 4;
+    private static final double LOAD_FACTOR = 0.7;

Then update the map initializations on lines 48-49:

-    GroupByCharSequenceLongHashMap mapA = new GroupByCharSequenceLongHashMap(initialCapacity, loadFactor, LONG_NULL, LONG_NULL);
-    GroupByCharSequenceLongHashMap mapB = new GroupByCharSequenceLongHashMap(initialCapacity, loadFactor, LONG_NULL, LONG_NULL);
+    GroupByCharSequenceLongHashMap mapA = new GroupByCharSequenceLongHashMap(INITIAL_CAPACITY, LOAD_FACTOR, LONG_NULL, LONG_NULL);
+    GroupByCharSequenceLongHashMap mapB = new GroupByCharSequenceLongHashMap(INITIAL_CAPACITY, LOAD_FACTOR, LONG_NULL, LONG_NULL);

Based on past review comments.

core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java (1)

180-201: Verify determinism when multiple values share the maximum count.

Similar to ModeSymbolGroupByFunction, when multiple varchar values tie for the maximum count, the implementation returns the first one encountered during hash map iteration, which may be non-deterministic.

This is a duplicate concern across all Mode implementations. The verification script in the ModeSymbolGroupByFunction review will cover this as well.

🧹 Nitpick comments (23)
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactory.java (1)

34-56: Factory implementation follows standard pattern.

The factory correctly:

  • Returns signature "mode(Ø)" for VARCHAR type
  • Indicates participation in group-by via isGroupBy() = true
  • Constructs ModeVarcharGroupByFunction with the first argument

This is consistent with the QuestDB function factory pattern.

Note: The implementation assumes the planner ensures args has at least one element. If you want defensive validation, consider adding:

 public Function newInstance(
         int position,
         ObjList<Function> args,
         IntList argPositions,
         CairoConfiguration configuration,
         SqlExecutionContext sqlExecutionContext
 ) {
+    if (args.size() != 1) {
+        throw SqlException.$(position, "mode() expects exactly 1 argument");
+    }
     return new ModeVarcharGroupByFunction(args.getQuick(0));
 }

However, this validation is typically done by the planner, so omitting it here is acceptable.

core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunction.java (2)

88-109: Mode-finding algorithm is correct but could be optimized.

The linear scan through all map slots to find the maximum count is correct but not optimal for large maps.

Consider tracking the mode key/count during updates (in computeFirst/computeNext) rather than scanning the entire map in getLong(). This would reduce the time complexity from O(capacity) to O(1) for retrieval, at the cost of slightly more work during updates.

However, if this is deferred optimization and current performance is acceptable, the current implementation is correct and safe.


46-49: Consider making initialCapacity and loadFactor configurable.

The hardcoded initialCapacity = 4 and loadFactor = 0.7d may not be optimal for all use cases. For high-cardinality columns (as warned in PR comments), this could lead to many rehashes.

Consider:

  1. Making these configurable via constructor parameters or configuration
  2. Adding documentation about memory implications for high-cardinality data
  3. Exposing metrics for map size/rehash counts for observability

This aligns with the PR comment from nwoolmer warning about "cardinality and RAM usage for hashmaps."

core/src/main/java/io/questdb/griffin/engine/groupby/GroupByLongLongHashMap.java (3)

78-93: inc(key, delta) does not validate delta sign.

The method increments a value by delta without checking if delta is negative. While this might be intentional for flexibility, negative deltas could produce nonsensical counts in the context of mode aggregation (where counts represent frequencies and should be non-negative).

If negative deltas are not a valid use case, add validation:

 public void inc(long key, long delta) {
     if (key != noKeyValue) {
+        if (delta < 0) {
+            throw new IllegalArgumentException("delta must be non-negative");
+        }
         long index = keyIndex(key);

Alternatively, document that negative deltas are supported if there's a valid use case (e.g., decrements).


110-123: of(long ptr) allocates new map even if ptr is 0—clarify intent.

When ptr == 0, the method allocates a new map and initializes it. However, the method name of suggests it's restoring from an existing pointer. The behavior is correct but potentially confusing.

Consider renaming to ofOrAllocate(long ptr) or add a javadoc comment:

+    /**
+     * Restores the map from the given pointer, or allocates a new map if ptr is 0.
+     */
     public GroupByLongLongHashMap of(long ptr) {

182-209: Rehashing frees old buffer before verifying new map is valid.

Line 208 calls allocator.free(oldPtr, ...) immediately after rehashing. If there's a bug in the rehashing logic (e.g., incorrect keyIndex calculation), the old data is already lost, making debugging difficult.

For safer debugging, consider deferring the free until after a validation pass (in debug builds):

     }
+    // In debug mode, validate the rehashed map before freeing old buffer
+    assert validateRehash(oldPtr, oldCapacity);
     allocator.free(oldPtr, HEADER_SIZE + 16L * oldCapacity);
 }
+
+private boolean validateRehash(long oldPtr, int oldCapacity) {
+    // Check that all old entries are present in new map
+    for (long p = oldPtr + HEADER_SIZE, lim = oldPtr + HEADER_SIZE + 16L * oldCapacity; p < lim; p += 16L) {
+        long key = Unsafe.getUnsafe().getLong(p);
+        if (key != noKeyValue) {
+            long value = Unsafe.getUnsafe().getLong(p + 8L);
+            assert get(key) == value : "rehash validation failed";
+        }
+    }
+    return true;
+}

This is optional and primarily for debugging.

core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunction.java (1)

46-49: Extract initialCapacity and loadFactor into configurable parameters
All Mode*GroupByFunction implementations (String, Varchar, Symbol, Long, Double) currently hardcode initialCapacity = 4 and loadFactor = 0.7d; consider exposing these via constructor arguments or central configuration and tuning based on expected cardinality or query hints.

benchmarks/src/main/java/org/questdb/GroupByLongLongHashMapBenchmark.java (1)

54-62: Static fields with @State(Scope.Thread) may cause issues.

The class uses @State(Scope.Thread), but the allocator, maps, and RNG are declared as static fields. This means they are shared across all threads if JMH runs multiple threads. If the benchmark is intended to be single-threaded (forks=1 in main), this is fine, but the pattern is unconventional. Typically, fields should be instance fields when using Scope.Thread.

Consider making these instance fields to align with Scope.Thread semantics:

-    private static final GroupByAllocator allocator = new FastGroupByAllocator(128 * 1024, Numbers.SIZE_1GB);
+    private GroupByAllocator allocator = new FastGroupByAllocator(128 * 1024, Numbers.SIZE_1GB);

(Apply similar changes to other static fields, and adjust reset() to recreate the allocator instead of closing and reusing it.)

core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java (1)

78-80: Consider validating the index parameter.

The inc(int index) method assumes the index is negative (occupied slot) and directly accesses values[-index - 1]. If called with a positive index, this could cause array bounds issues.

Consider adding a precondition check:

 public void inc(int index) {
+    assert index < 0 : "inc(int) expects a negative index from keyIndex()";
     values[-index - 1]++;
 }

Or document the precondition in a JavaDoc comment.

core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java (1)

46-56: Document default false fallback in ModeBooleanGroupByFunction
The behaviour is intentional: when all inputs are NULL, both counters are set to LONG_NULL and getBool evaluates trueCount > falseCount as false. Add a Javadoc note in core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunction.java (and update the FunctionFactory contract) to explicitly document this default.

core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java (1)

179-200: Optimize mode extraction with early exit.

The current implementation scans the entire capacity even after finding the mode. While functionally correct, consider that for large maps this scans every slot including empty ones.

Consider tracking the maximum count during insertion/merge to avoid the full scan:

// Add field:
private long cachedModeKey = LONG_NULL;
private long cachedModeCount = -1;

// Update in computeNext/merge to track max
// Then getStr becomes:
CharSequence getStr(Record record, GroupByCharSink sink) {
    long mapPtr = record.getLong(valueIndex);
    if (mapPtr <= 0) {
        return null;
    }
    mapA.of(mapPtr);
    if (cachedModeKey != LONG_NULL) {
        sink.of(cachedModeKey);
        return sink;
    }
    // fallback to scan...
}

Note: This optimization adds complexity and would require careful handling during merge operations.

core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8SequenceLongHashMap.java (1)

258-273: Probe method detects corruption but lacks reprobing limit.

The probe method correctly implements linear probing and detects when the table is full by checking if it wraps around to the starting index. However, for a properly sized table this should rarely happen.

Consider whether the error message on line 272 should include diagnostic information (size, capacity, key) to aid debugging if corruption occurs in production.

-        throw CairoException.critical(0).put("corrupt utf8 sequence long long hash map");
+        throw CairoException.critical(0)
+            .put("corrupt utf8 sequence long hash map [size=").put(size())
+            .put(", capacity=").put(capacity())
+            .put(", key=").put(key)
+            .put(']');
core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java (4)

51-54: Hard-coded capacity and load factor may not suit all workloads.

The initialCapacity = 4 and loadFactor = 0.7d are hard-coded. For high-cardinality symbol columns, starting with capacity 4 may cause excessive resizing, while for low-cardinality columns it may waste memory. Consider making these configurable or adaptive based on symbol table characteristics.


51-54: Consider making capacity and load factor configurable or document the rationale.

The hard-coded initialCapacity = 4 and loadFactor = 0.7d may lead to early resizing in typical aggregation scenarios with moderate-to-high cardinality. If these values are based on empirical tuning or match conventions across other Mode implementations, consider adding a comment explaining the choice.


103-114: Capacity iteration is intentional—add clarifying comment
GroupByLongLongHashMap lacks an entry iterator; iterating over capacity with LONG_NULL checks is the correct approach. Add a comment above the loop explaining this intent for future maintainers.


51-54: Consider making initial capacity and load factor configurable.

The hardcoded initialCapacity = 4 and loadFactor = 0.7d may not be optimal for all use cases. For columns with high cardinality, a larger initial capacity could reduce resize operations. Consider exposing these as constructor parameters or configuration options, similar to other hash map implementations in the codebase.

core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java (4)

47-52: Hard-coded capacity and load factor may not suit all workloads.

Similar to ModeSymbolGroupByFunction, the initialCapacity = 4 and loadFactor = 0.7d are hard-coded. For high-cardinality varchar columns (e.g., free-text fields), this may cause excessive resizing and memory churn.


47-52: Consider making capacity and load factor configurable or document the rationale.

As with ModeSymbolGroupByFunction, the hard-coded initialCapacity = 4 and loadFactor = 0.7d may cause early resizing for moderate cardinality. Document the reasoning or consider making these tunable.


189-198: No change needed for capacity() loop
GroupByUtf8SequenceLongHashMap does not expose an iterator or entry-only traversal API, so iterating over capacity() and filtering out empty slots is the correct—and only—way to find occupied entries. Optionally, add a brief code comment explaining this choice for future maintainers.


47-52: Consider making initial capacity and load factor configurable.

Similar to ModeSymbolGroupByFunction, the hardcoded initialCapacity = 4 and loadFactor = 0.7d may not be optimal for high-cardinality VARCHAR columns. Consider exposing these as configuration options.

core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactoryTest.java (2)

30-270: Consider adding floating-point precision edge case tests.

While the test suite is comprehensive, it doesn't explicitly test scenarios where floating-point precision matters:

  • Very close but distinct values (e.g., 1.0 and 1.0 + Double.MIN_VALUE)
  • NaN and Infinity handling
  • Subnormal numbers

These edge cases could expose issues in the hash map key comparison logic for doubles.


30-271: Consider adding tests for NaN and Infinity.

The test suite is comprehensive, but consider adding edge cases for Double.NaN and Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY to ensure they're handled correctly (e.g., are NaN values treated as distinct or ignored? What happens if all values are NaN?).

benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java (1)

86-93: Hoist ptr retrieval outside the loop

GroupByUtf8SequenceLongHashMap.ptr() never changes on put (no relocation), so updating mapPtr after each iteration is redundant—move mapPtr = …ptr() to before the loop.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dea713f and cc2a1bc.

📒 Files selected for processing (34)
  • benchmarks/src/main/java/org/questdb/GroupByCharSequenceLongHashMapBenchmark.java (1 hunks)
  • benchmarks/src/main/java/org/questdb/GroupByLongLongHashMapBenchmark.java (1 hunks)
  • benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunction.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactory.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunction.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactory.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunction.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunctionFactory.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactory.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunctionFactory.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactory.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/groupby/GroupByCharSequenceLongHashMap.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/groupby/GroupByLongLongHashMap.java (1 hunks)
  • core/src/main/java/io/questdb/griffin/engine/groupby/GroupByUtf8SequenceLongHashMap.java (1 hunks)
  • core/src/main/java/io/questdb/std/CharSequenceLongHashMap.java (1 hunks)
  • core/src/main/java/io/questdb/std/IntLongHashMap.java (1 hunks)
  • core/src/main/java/io/questdb/std/LongLongHashMap.java (1 hunks)
  • core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactoryTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeGroupByFunctionTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeIntegrationTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeLongGroupByFunctionFactoryTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactoryTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeSymbolGroupByFunctionFactoryTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactoryTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByCharSequenceLongHashMapTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByLongLongHashMapTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/griffin/engine/groupby/GroupByUtf8SequenceLongHashMapTest.java (1 hunks)
  • core/src/test/java/io/questdb/test/std/Utf8SequenceLongHashMapTest.java (1 hunks)
👮 Files not reviewed due to content moderation or server errors (8)
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactoryTest.java
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeIntegrationTest.java
  • benchmarks/src/main/java/org/questdb/GroupByUtf8SequenceLongHashMapBenchmark.java
  • core/src/main/java/io/questdb/std/Utf8SequenceLongHashMap.java
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactoryTest.java
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java
  • core/src/main/java/io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java
  • core/src/test/java/io/questdb/test/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactoryTest.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: New pull request (Coverage Report Coverage Report)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-x64-zfs)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests on linux-arm64)
  • GitHub Check: New pull request (SelfHosted Other tests on linux-x64-zfs)
  • GitHub Check: New pull request (Hosted Running tests on windows-other)
  • GitHub Check: New pull request (Hosted Running tests on windows-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on windows-cairo)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz2)
  • GitHub Check: New pull request (Hosted Running tests on windows-fuzz1)
  • GitHub Check: New pull request (Hosted Running tests on windows-griffin)
  • GitHub Check: New pull request (Hosted Running tests on mac-other)
  • GitHub Check: New pull request (Hosted Running tests on mac-pgwire)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo-fuzz)
  • GitHub Check: New pull request (Hosted Running tests on mac-cairo)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests on mac-griffin)
  • GitHub Check: New pull request (SelfHosted Other tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Griffin And Fuzz tests Start X64Zfs Agent)
  • GitHub Check: New pull request (SelfHosted Other tests Start ARM Agent)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-other)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-pgwire)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-cairo)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-fuzz)
  • GitHub Check: New pull request (Hosted Running tests with cover on linux-griffin)
  • GitHub Check: New pull request (Rust Test and Lint on linux-jdk17)
  • GitHub Check: New pull request (Check Changes Check changes)

Comment thread core/src/main/java/io/questdb/std/CharSequenceLongHashMap.java
mtopolnik
mtopolnik previously approved these changes Oct 1, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 776 / 859 (90.34%)

file detail

path covered line new line coverage
🔵 io/questdb/std/CharSequenceLongHashMap.java 0 8 00.00%
🔵 io/questdb/std/IntLongHashMap.java 0 6 00.00%
🔵 io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunction.java 61 73 83.56%
🔵 io/questdb/griffin/engine/functions/groupby/ModeSymbolGroupByFunctionFactory.java 6 7 85.71%
🔵 io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunction.java 55 64 85.94%
🔵 io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunction.java 55 64 85.94%
🔵 io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunction.java 26 30 86.67%
🔵 io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunction.java 64 73 87.67%
🔵 io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunction.java 65 73 89.04%
🔵 io/questdb/std/Utf8SequenceLongHashMap.java 87 92 94.57%
🔵 io/questdb/griffin/engine/groupby/GroupByLongLongHashMap.java 107 113 94.69%
🔵 io/questdb/griffin/engine/groupby/GroupByUtf8SequenceLongHashMap.java 115 118 97.46%
🔵 io/questdb/griffin/engine/groupby/GroupByCharSequenceLongHashMap.java 115 118 97.46%
🔵 io/questdb/griffin/engine/functions/groupby/ModeDoubleGroupByFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/ModeVarcharGroupByFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/ModeBooleanGroupByFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/ModeLongGroupByFunctionFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/functions/groupby/ModeStringGroupByFunctionFactory.java 4 4 100.00%

@bluestreak01 bluestreak01 merged commit ff18a8f into master Oct 1, 2025
35 checks passed
@bluestreak01 bluestreak01 deleted the nw_mode branch October 1, 2025 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

mode() aggregate function

5 participants