Skip to content

perf(sql): speed up row counting in filter queries#6580

Merged
bluestreak01 merged 12 commits intomasterfrom
puzpuzpuz_count_only_jit
Dec 28, 2025
Merged

perf(sql): speed up row counting in filter queries#6580
bluestreak01 merged 12 commits intomasterfrom
puzpuzpuz_count_only_jit

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Dec 26, 2025

Implements a fast-path for count-only (SELECT count(*) FROM table WHERE ...) queries that avoids materializing matching rows, instead just incrementing a counter. This is highly effective when many rows match the filter.

Key changes:

  1. New JIT count-only function:
    - Does not accept rows list for filtered row indexes, but only return the number of rows satisfying the filter
    - 16-bit, 32-bit, and 64-bit SIMD paths have no vector-to-scalar domain crossing while the generic SIMD path uses popcnt instruction on bitmasks - much faster than scatter-gather
  2. Simplified row index handling:
    - Removed rows_start_offset parameter from filter functions as it was always set to 0
    - Functions now work with relative indices directly, reducing arithmetic overhead
  3. New countOnly flag in PageFrameReduceTask
    - Used only in calculateSize() implementation in Async*Filtered factories

JIT assembly before/after

Sample query:

SELECT count(*) FROM hits WHERE RefererHash = 3594120000172545465;

AVX2 Loop Comparison

Aspect Before (Regular Filter) After (Count-Only)
Loop body instructions ~20 6
Memory writes Yes (vmovdqu to store row IDs) None
Conditional branch in loop Yes (test/jz to skip scatter) None
Row ID tracking Yes (vpaddq per iteration) None
Domain crossing Yes (vmovmskpd, vpmovmskb) None
Expensive scatter logic Yes (pext/pdep/vpermps) None

AVX2 loop body before (compacted to instructions only):

vmovdqu, vpcmpeqq, vmovmskpd, test, jz,
vpmovmskb, mov, pext, movabs, pdep, vmovq,
vpmovzxbd, vpermps, vmovdqu, popcnt, add,
vpaddq, add, cmp, jl

AVX2 loop body after:

vmovdqu ymm0, [rbp+rbx*8]           ; load 4 x i64
vpcmpeqq ymm0, ymm0, ymm2           ; compare
vpsubq ymm1, ymm1, ymm0             ; accumulate count
add rbx, 4                          ; index += 4
cmp rbx, rcx
jl L3

The count-only optimization eliminates:

  • Row ID scatter logic (pext/pdep/vpermps)
  • Memory writes for row IDs
  • Row ID tracking register updates
  • Conditional branches in the hot loop
  • Vector-to-scalar domain crossings

This reduces the loop from ~20 instructions to 6 instructions (~70% reduction), with all remaining instructions being simple and fully pipelined.

Benchmarks

Key observations:

  1. Count-only queries are dramatically faster - testCountOnlyFilter shows 60-92% improvements for NEQ predicates
  2. Single-column count-only scalar benchmarks show ~30% regression for i16 and i32, but in reality the functions will use SIMD. Also, the absolute numbers are pretty small for both before and after (< 50ms). The new count-only loops are simpler and have less instructions
  3. Regular filter queries (testFilter) are unchanged - as expected, since they need to materialize rows
  4. JIT DISABLED also benefits - the optimization works at a higher level than JIT compilation
  5. NEQ benefits most - predicates matching many rows see the biggest gains (no need to materialize/copy all matching data)

Before:

Benchmark                                          (boolOperation)  (jitMode)  (predicate)  Mode  Cnt    Score   Error  Units
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND     SCALAR           EQ    ss   10   60.887 ± 2.758  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND     SCALAR          NEQ    ss   10  169.639 ± 4.343  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND   DISABLED           EQ    ss   10  168.690 ± 1.112  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND   DISABLED          NEQ    ss   10  459.096 ± 9.357  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR     SCALAR           EQ    ss   10   76.043 ± 2.250  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR     SCALAR          NEQ    ss   10  142.633 ± 5.501  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR   DISABLED           EQ    ss   10  418.715 ± 3.413  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR   DISABLED          NEQ    ss   10  290.168 ± 6.095  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND     SCALAR           EQ    ss   10   61.347 ± 2.677  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND     SCALAR          NEQ    ss   10  267.664 ± 3.972  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND   DISABLED           EQ    ss   10  169.923 ± 3.423  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND   DISABLED          NEQ    ss   10  556.205 ± 8.773  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR     SCALAR           EQ    ss   10   76.973 ± 2.633  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR     SCALAR          NEQ    ss   10  239.186 ± 6.964  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR   DISABLED           EQ    ss   10  418.001 ± 5.192  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR   DISABLED          NEQ    ss   10  385.776 ± 5.530  ms/op

Benchmark                                        (column)  (jitMode)  (predicate)  Mode  Cnt    Score    Error  Units
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64       SIMD           EQ    ss   10   33.242 ±  1.686  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64       SIMD          NEQ    ss   10  128.578 ±  8.023  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64     SCALAR           EQ    ss   10   59.244 ±  1.267  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64     SCALAR          NEQ    ss   10  134.607 ±  6.808  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64   DISABLED           EQ    ss   10   38.762 ±  2.664  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64   DISABLED          NEQ    ss   10  270.637 ±  6.399  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32       SIMD           EQ    ss   10   17.428 ±  1.527  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32       SIMD          NEQ    ss   10  119.956 ±  3.422  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32     SCALAR           EQ    ss   10   36.946 ±  1.225  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32     SCALAR          NEQ    ss   10  121.141 ±  2.285  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32   DISABLED           EQ    ss   10   26.422 ±  2.199  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32   DISABLED          NEQ    ss   10  252.666 ±  9.753  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16       SIMD           EQ    ss   10    9.846 ±  1.636  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16       SIMD          NEQ    ss   10  113.872 ±  3.149  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16     SCALAR           EQ    ss   10   31.785 ±  1.739  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16     SCALAR          NEQ    ss   10  118.642 ±  1.883  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16   DISABLED           EQ    ss   10   87.089 ±  2.444  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16   DISABLED          NEQ    ss   10  250.200 ± 16.372  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64       SIMD           EQ    ss   10   33.717 ±  2.084  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64       SIMD          NEQ    ss   10  228.001 ±  6.670  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64     SCALAR           EQ    ss   10   60.238 ±  2.321  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64     SCALAR          NEQ    ss   10  233.769 ±  5.384  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64   DISABLED           EQ    ss   10   37.377 ±  1.793  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64   DISABLED          NEQ    ss   10  373.045 ± 18.863  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32       SIMD           EQ    ss   10   18.088 ±  2.511  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32       SIMD          NEQ    ss   10  218.096 ±  4.387  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32     SCALAR           EQ    ss   10   37.320 ±  2.226  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32     SCALAR          NEQ    ss   10  218.514 ±  4.978  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32   DISABLED           EQ    ss   10   25.965 ±  1.699  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32   DISABLED          NEQ    ss   10  349.078 ±  5.742  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16       SIMD           EQ    ss   10   10.094 ±  2.448  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16       SIMD          NEQ    ss   10  209.443 ±  4.983  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16     SCALAR           EQ    ss   10   44.444 ±  1.781  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16     SCALAR          NEQ    ss   10  215.291 ±  5.164  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16   DISABLED           EQ    ss   10   87.794 ±  4.303  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16   DISABLED          NEQ    ss   10  349.063 ± 20.755  ms/op

After:

Benchmark                                          (boolOperation)  (jitMode)  (predicate)  Mode  Cnt    Score    Error  Units
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND     SCALAR           EQ    ss   10   50.644 ±  1.960  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND     SCALAR          NEQ    ss   10   70.728 ±  3.136  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND   DISABLED           EQ    ss   10  103.500 ±  2.751  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter              AND   DISABLED          NEQ    ss   10  127.538 ±  2.995  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR     SCALAR           EQ    ss   10   76.137 ±  2.354  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR     SCALAR          NEQ    ss   10   49.824 ±  1.903  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR   DISABLED           EQ    ss   10  435.005 ±  3.147  ms/op
SqlJitCompilerScalarBenchmark.testCountOnlyFilter               OR   DISABLED          NEQ    ss   10  191.398 ±  6.133  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND     SCALAR           EQ    ss   10   49.515 ±  1.491  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND     SCALAR          NEQ    ss   10  264.311 ±  3.700  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND   DISABLED           EQ    ss   10  168.271 ±  3.637  ms/op
SqlJitCompilerScalarBenchmark.testFilter                       AND   DISABLED          NEQ    ss   10  545.947 ± 40.527  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR     SCALAR           EQ    ss   10   76.926 ±  1.515  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR     SCALAR          NEQ    ss   10  232.888 ±  5.406  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR   DISABLED           EQ    ss   10  417.487 ±  2.469  ms/op
SqlJitCompilerScalarBenchmark.testFilter                        OR   DISABLED          NEQ    ss   10  389.163 ± 13.271  ms/op

Benchmark                                        (column)  (jitMode)  (predicate)  Mode  Cnt    Score    Error  Units
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64       SIMD           EQ    ss   10   32.561 ±  1.419  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64       SIMD          NEQ    ss   10   32.608 ±  1.890  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64     SCALAR           EQ    ss   10   58.855 ±  1.199  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64     SCALAR          NEQ    ss   10   47.848 ±  2.655  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64   DISABLED           EQ    ss   10   37.862 ±  1.454  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i64   DISABLED          NEQ    ss   10   38.049 ±  1.807  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32       SIMD           EQ    ss   10   17.954 ±  3.978  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32       SIMD          NEQ    ss   10   17.044 ±  1.446  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32     SCALAR           EQ    ss   10   48.817 ±  1.083  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32     SCALAR          NEQ    ss   10   36.570 ±  1.209  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32   DISABLED           EQ    ss   10   26.744 ±  1.739  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i32   DISABLED          NEQ    ss   10   27.016 ±  2.108  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16       SIMD           EQ    ss   10    9.361 ±  1.607  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16       SIMD          NEQ    ss   10    9.479 ±  1.177  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16     SCALAR           EQ    ss   10   43.731 ±  1.465  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16     SCALAR          NEQ    ss   10   32.213 ±  1.220  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16   DISABLED           EQ    ss   10   25.497 ±  1.360  ms/op
SqlJitCompilerSimdBenchmark.testCountOnlyFilter       i16   DISABLED          NEQ    ss   10   29.312 ±  3.286  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64       SIMD           EQ    ss   10   33.229 ±  1.790  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64       SIMD          NEQ    ss   10  232.032 ± 15.309  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64     SCALAR           EQ    ss   10   47.755 ±  1.295  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64     SCALAR          NEQ    ss   10  232.249 ±  3.619  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64   DISABLED           EQ    ss   10   37.868 ±  1.720  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i64   DISABLED          NEQ    ss   10  364.726 ± 17.587  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32       SIMD           EQ    ss   10   19.452 ±  3.277  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32       SIMD          NEQ    ss   10  216.426 ± 11.307  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32     SCALAR           EQ    ss   10   37.137 ±  1.416  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32     SCALAR          NEQ    ss   10  218.074 ±  6.994  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32   DISABLED           EQ    ss   10   27.189 ±  1.922  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i32   DISABLED          NEQ    ss   10  360.586 ±  6.583  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16       SIMD           EQ    ss   10   10.438 ±  2.168  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16       SIMD          NEQ    ss   10  209.816 ±  3.728  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16     SCALAR           EQ    ss   10   32.412 ±  1.769  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16     SCALAR          NEQ    ss   10  211.853 ±  2.961  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16   DISABLED           EQ    ss   10   87.083 ±  1.556  ms/op
SqlJitCompilerSimdBenchmark.testFilter                i16   DISABLED          NEQ    ss   10  351.622 ±  5.744  ms/op

@puzpuzpuz puzpuzpuz self-assigned this Dec 26, 2025
@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Dec 26, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 26, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch puzpuzpuz_count_only_jit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@puzpuzpuz puzpuzpuz marked this pull request as ready for review December 27, 2025 12:52
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 101 / 106 (95.28%)

file detail

path covered line new line coverage
🔵 io/questdb/griffin/SqlCodeGenerator.java 4 5 80.00%
🔵 io/questdb/cairo/sql/async/PageFrameSequence.java 5 6 83.33%
🔵 io/questdb/jit/CompiledCountOnlyFilter.java 16 19 84.21%
🔵 io/questdb/griffin/engine/table/AsyncFilteredNegativeLimitRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/AsyncFilteredRecordCursor.java 12 12 100.00%
🔵 io/questdb/griffin/engine/table/AsyncFilteredRecordCursorFactory.java 15 15 100.00%
🔵 io/questdb/jit/CompiledFilter.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/AsyncFilterUtils.java 1 1 100.00%
🔵 io/questdb/cairo/sql/async/PageFrameReduceTask.java 9 9 100.00%
🔵 io/questdb/griffin/engine/table/AsyncJitFilteredRecordCursorFactory.java 37 37 100.00%

@bluestreak01 bluestreak01 merged commit 23202fc into master Dec 28, 2025
48 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_count_only_jit branch December 28, 2025 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants