Skip to content

perf(sql): speed up JIT-compiled filters by reordering predicates and short-circuiting them#6568

Merged
bluestreak01 merged 40 commits intomasterfrom
puzpuzpuz_jit_short_circuit
Dec 24, 2025
Merged

perf(sql): speed up JIT-compiled filters by reordering predicates and short-circuiting them#6568
bluestreak01 merged 40 commits intomasterfrom
puzpuzpuz_jit_short_circuit

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Dec 19, 2025

This PR introduces several performance optimizations to the JIT-compiled filter execution in QuestDB.

Short-circuit evaluation

Implements short-circuit evaluation for scalar AND/OR predicate chains:

  • AND chains: If a predicate evaluates to false, skip remaining predicates and move to next row
  • OR chains: If a predicate evaluates to true, skip remaining predicates and store the row
  • IN() operator: Optimized using short-circuit OR chains internally (only in case of the outer AND chain)

New IR opcodes: And_Sc, Or_Sc, Begin_Sc, End_Sc

Predicate reordering

Predicates are automatically sorted by estimated selectivity to maximize short-circuit benefits:

  • Equality comparisons (=, !=) are prioritized over range comparisons: uuid eq > long eq > ... > others > ... > long neq > uuid neq

Register/memory hoisting

Several caches reduce redundant memory loads inside the hot loop:

Cache Purpose Capacity
ColumnAddressCache Pre-loads column base addresses before the loop 8 columns
ConstantCache Pre-loads constants into GP/XMM registers 8 constants
ColumnValueCache Caches column values within a single row iteration 8 values
ConstantCacheYmm Pre-broadcasts constants into YMM registers (SIMD) 8 constants

SIMD scatter short-circuiting

In the AVX2 SIMD loop, the scatter phase (writing matching row IDs) is now skipped when the mask is zero (no matches in the batch). This avoids expensive compress/scatter operations when filtering is highly selective.

Code generation improvements

  • Uses TEST instead of CMP for zero-checks where possible
  • Removes redundant AND/OR instructions (short-circuit jumps suffice)
  • Avoids redundant TEST/SETE sequences for equality comparisons by tracking comparison flags
  • Adds missing register XOR instructions to break false dependencies
  • Makes row index store conditional: used to be executed even when the filter is false for the given row

Benchmarks

On my box (Ryzen 7900x 64GB RAM Ubuntu 24.04) I've got the following difference in ClickBench's Hot Run (patch is on the left, master is on the right):

Screenshot from 2025-12-22 23-02-05

Also, I've got the following results in JMH benchmarks.

SqlJitCompilerScalarBenchmark (AND/OR predicate chains)

Benchmark Before After Change
AND + SCALAR + EQ 111.2 ms 64.2 ms +42% faster ✅
AND + SCALAR + NEQ 292.8 ms 278.1 ms +5% faster ✅
OR + SCALAR + EQ 110.4 ms 85.3 ms +23% faster ✅
OR + SCALAR + NEQ 295.2 ms 247.6 ms +16% faster ✅

SqlJitCompilerSimdBenchmark (single predicate)

SIMD mode:

Column Predicate Before After Change
i64 EQ 39.7 ms 35.8 ms +10% faster ✅
i64 NEQ 228.3 ms 228.2 ms ~same
i32 EQ 39.0 ms 18.6 ms +52% faster ✅
i32 NEQ 215.9 ms 217.6 ms ~same
i16 EQ 33.0 ms 10.3 ms +69% faster ✅
i16 NEQ 213.5 ms 208.3 ms +2% faster ✅

Scalar mode:

Column Predicate Before After Change
i64 EQ 59.8 ms 63.3 ms -6% ⚠️
i64 NEQ 244.5 ms 231.1 ms +5% faster ✅
i32 EQ 47.4 ms 39.6 ms +17% faster ✅
i32 NEQ 224.3 ms 218.0 ms +3% faster ✅
i16 EQ 44.6 ms 46.3 ms -4% ⚠️
i16 NEQ 220.6 ms 212.7 ms +4% faster ✅

Key takeaways:

  1. Biggest wins: SIMD mode with selective filters (EQ) - up to 69% faster for i16
  2. Scalar AND/OR chains: Significant improvements (23-42% faster) due to short-circuit evaluation
  3. NEQ predicates (low selectivity): Modest improvements since most rows match anyway
  4. Minor regressions: Scalar mode with simple EQ predicates on i64/i16 shows slight overhead (~5%), likely due to additional setup for hoisting that doesn't pay off for simple single-predicate filters. On the other hand, single-predicate filters are compiled with SIMD in most cases.

A comparison of the generated code before-after

Let's consider the following query on hits table from ClickBench:

SELECT COUNT(*)
FROM hits
WHERE CounterID = 62
  AND EventTime >= '2013-07-01T00:00:00Z' AND EventTime <= '2013-07-31T23:59:59Z'
  AND IsRefresh = 0
  AND TraficSourceID IN (-1, 6)
  AND RefererHash = 3594120000172545465;

On my box it takes 41ms on master and 24ms with this patch.

Summary of the before/after assembly:

  1. ~44% fewer instructions in the hot loop
  2. No redundant memory loads (TraficSourceID was loaded twice before)
  3. Constants and column addresses hoisted out of the loop
  4. Short-circuit evaluation - most rows fail early and skip remaining predicates
  5. Conditional store - only writes to output when row matches (vs. always writing then conditionally incrementing)
  6. No SETE/AND chains - replaced by direct conditional jumps

For highly selective filters (few matches), the short-circuit behavior provides the biggest win since most rows exit after the first 1-2 predicate checks.

On master the assembly generated by JIT looks like this:

.section .text {#0}
L1:                                         ; L1: i64@rax Func(u64@rdi data_ptr, i64@rsi data_size, u64@rdx varsize_aux_ptr, i64@rcx vars_ptr, u64@r8 vars_size, i64@r9 rows_ptr, u64@[0] rows_size, i64@[8] rows_id_start_offset, i64@[16] <none>)
push rbx
push rbp
push r12
mov r11, qword ptr [rsp+32]
mov r12, qword ptr [rsp+40]
mov r8, 0                                   ; mov input_index, 0
mov rax, 0                                  ; mov output_index, 0
cmp r8, r11                                 ; cmp input_index, rows_size
jge L3                                      ; jge L3
L2:                                         ; L2:
mov rcx, qword ptr [rdi+32]                 ; mov column_address, qword ptr [data_ptr+32]
mov rdx, qword ptr [rcx+r8*8]               ; mov i64_mem, qword ptr [column_address+input_index*8]
movabs rcx, 3594120000172545465             ; movabs i64_imm 40457657, 3594120000172545465
xor r10, r10                                ; xor %13, %13
cmp rdx, rcx                                ; cmp i64_mem, i64_imm 40457657
rex sete r10b                               ; sete %13@gpb
mov rcx, qword ptr [rdi+24]                 ; mov column_address, qword ptr [data_ptr+24]
mov edx, dword ptr [rcx+r8*4]               ; mov i32_mem, dword ptr [column_address+input_index*4]
mov ecx, 6                                  ; mov i32_imm 6, 6
xor ebx, ebx                                ; xor %17, %17
cmp edx, ecx                                ; cmp i32_mem, i32_imm 6
sete bl                                     ; sete %17@gpb
mov rcx, qword ptr [rdi+24]                 ; mov column_address, qword ptr [data_ptr+24]
mov edx, dword ptr [rcx+r8*4]               ; mov i32_mem, dword ptr [column_address+input_index*4]
mov ecx, -1                                 ; mov i32_imm -1, -1
xor esi, esi                                ; xor %21, %21
cmp edx, ecx                                ; cmp i32_mem, i32_imm -1
rex sete sil                                ; sete %21@gpb
; int32_or_start
or esi, ebx                                 ; or %21, %17
; int32_or_stop
mov rcx, qword ptr [rdi+16]                 ; mov column_address, qword ptr [data_ptr+16]
movsx edx, byte ptr [rcx+r8]                ; movsx i8_mem, byte ptr [column_address+input_index]
mov ecx, 0                                  ; mov i32_imm 0, 0
xor ebp, ebp                                ; xor %25, %25
cmp edx, ecx                                ; cmp i8_mem, i32_imm 0
rex sete bpl                                ; sete %25@gpb
mov rcx, qword ptr [rdi]                    ; mov column_address, qword ptr [data_ptr]
mov ebx, dword ptr [rcx+r8*4]               ; mov i32_mem, dword ptr [column_address+input_index*4]
mov ecx, 62                                 ; mov i32_imm 62, 62
xor edx, edx                                ; xor %29, %29
cmp ebx, ecx                                ; cmp i32_mem, i32_imm 62
sete dl                                     ; sete %29@gpb
and edx, ebp                                ; and %29, %25
and edx, esi                                ; and %29, %21
and edx, r10d                               ; and %29, %13@gpd
lea rcx, [r8+r12]                           ; lea input_index_+_rows_id_start_offset, [input_index+rows_id_start_offset]
mov qword ptr [r9+rax*8], rcx               ; mov qword ptr [rows_ptr+output_index*8], input_index_+_rows_id_start_offset
and edx, 1                                  ; and %29, 1
add rax, rdx                                ; add output_index, %29@gpq
add r8, 1                                   ; add input_index, 1
cmp r8, r11                                 ; cmp input_index, rows_size
jl L2                                       ; jl L2
L3:                                         ; L3:
L0:                                         ; L0:
pop r12
pop rbp
pop rbx
ret

With the patch, it's the following:

.section .text {#0}
L1:                                         ; L1: i64@rax Func(u64@rdi data_ptr, i64@rsi data_size, u64@rdx varsize_aux_ptr, i64@rcx vars_ptr, u64@r8 vars_size, i64@r9 rows_ptr, u64@[0] rows_size, i64@[8] rows_id_start_offset, i64@[16] <none>)
push rbx
push rbp
push r12
push r13
push r14
push r15
mov r14, qword ptr [rsp+56]
mov r15, qword ptr [rsp+64]
mov rdx, 0                                  ; mov input_index, 0
mov rax, 0                                  ; mov output_index, 0
mov r13, qword ptr [rdi+32]                 ; mov col_addr_4, qword ptr [data_ptr+32]
mov r12, qword ptr [rdi]                    ; mov col_addr_0, qword ptr [data_ptr]
mov r11, qword ptr [rdi+16]                 ; mov col_addr_2, qword ptr [data_ptr+16]
mov r10, qword ptr [rdi+24]                 ; mov col_addr_3, qword ptr [data_ptr+24]
mov r8, 3594120000172545465                 ; mov const_3594120000172545465, 3594120000172545465
mov rdi, 62                                 ; mov const_62, 62
mov rsi, 0                                  ; mov const_0, 0
mov rbp, 6                                  ; mov const_6, 6
mov rbx, -1                                 ; mov const_-1, -1
cmp rdx, r14                                ; cmp input_index, rows_size
jge L3                                      ; jge L3
L2:                                         ; L2:
mov rcx, qword ptr [r13+rdx*8]              ; mov col_4_i64, qword ptr [col_addr_4+input_index*8]
cmp rcx, r8                                 ; cmp col_4_i64, const_3594120000172545465
jne L4                                      ; jne L4
mov ecx, dword ptr [r12+rdx*4]              ; mov col_0_i32, dword ptr [col_addr_0+input_index*4]
cmp ecx, edi                                ; cmp col_0_i32, const_62@gpd
jne L4                                      ; jne L4
movsx ecx, byte ptr [r11+rdx]               ; movsx col_2_i8, byte ptr [col_addr_2+input_index]
cmp ecx, esi                                ; cmp col_2_i8, const_0@gpd
jne L4                                      ; jne L4
mov ecx, dword ptr [r10+rdx*4]              ; mov col_3_i32, dword ptr [col_addr_3+input_index*4]
cmp ecx, ebp                                ; cmp col_3_i32, const_6@gpd
je L6                                       ; je L6
cmp ecx, ebx                                ; cmp col_3_i32, const_-1@gpd
jne L4                                      ; jne L4
L6:                                         ; L6:
L5:                                         ; L5:
lea rcx, [rdx+r15]                          ; lea input_index_+_rows_id_start_offset, [input_index+rows_id_start_offset]
mov qword ptr [r9+rax*8], rcx               ; mov qword ptr [rows_ptr+output_index*8], input_index_+_rows_id_start_offset
add rax, 1                                  ; add output_index, 1
L4:                                         ; L4:
add rdx, 1                                  ; add input_index, 1
cmp rdx, r14                                ; cmp input_index, rows_size
short jl L2                                 ; jl L2
L3:                                         ; L3:
L0:                                         ; L0:
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rbx
ret

@puzpuzpuz puzpuzpuz self-assigned this Dec 19, 2025
@puzpuzpuz puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Dec 19, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Dec 19, 2025

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch puzpuzpuz_jit_short_circuit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@puzpuzpuz puzpuzpuz force-pushed the puzpuzpuz_jit_short_circuit branch from 1c6938c to 6d5f1e7 Compare December 19, 2025 17:47
@puzpuzpuz puzpuzpuz force-pushed the puzpuzpuz_jit_short_circuit branch from d767bb3 to 82569bd Compare December 22, 2025 12:50
@puzpuzpuz puzpuzpuz marked this pull request as ready for review December 22, 2025 18:51
GitHub Actions - Rebuild Native Libraries and others added 3 commits December 22, 2025 18:55
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 230 / 239 (96.23%)

file detail

path covered line new line coverage
🔵 io/questdb/jit/CompiledFilterIRSerializer.java 230 239 96.23%

@bluestreak01 bluestreak01 merged commit 9052c26 into master Dec 24, 2025
45 checks passed
@bluestreak01 bluestreak01 deleted the puzpuzpuz_jit_short_circuit branch December 24, 2025 02:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Performance improvements SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants