perf(sql): speed up JIT-compiled filters by reordering predicates and short-circuiting them by puzpuzpuz · Pull Request #6568 · questdb/questdb

puzpuzpuz · 2025-12-19T17:44:14Z

This PR introduces several performance optimizations to the JIT-compiled filter execution in QuestDB.

Short-circuit evaluation

Implements short-circuit evaluation for scalar AND/OR predicate chains:

AND chains: If a predicate evaluates to false, skip remaining predicates and move to next row
OR chains: If a predicate evaluates to true, skip remaining predicates and store the row
IN() operator: Optimized using short-circuit OR chains internally (only in case of the outer AND chain)

New IR opcodes: And_Sc, Or_Sc, Begin_Sc, End_Sc

Predicate reordering

Predicates are automatically sorted by estimated selectivity to maximize short-circuit benefits:

Equality comparisons (=, !=) are prioritized over range comparisons: uuid eq > long eq > ... > others > ... > long neq > uuid neq

Register/memory hoisting

Several caches reduce redundant memory loads inside the hot loop:

Cache	Purpose	Capacity
`ColumnAddressCache`	Pre-loads column base addresses before the loop	8 columns
`ConstantCache`	Pre-loads constants into GP/XMM registers	8 constants
`ColumnValueCache`	Caches column values within a single row iteration	8 values
`ConstantCacheYmm`	Pre-broadcasts constants into YMM registers (SIMD)	8 constants

SIMD scatter short-circuiting

In the AVX2 SIMD loop, the scatter phase (writing matching row IDs) is now skipped when the mask is zero (no matches in the batch). This avoids expensive compress/scatter operations when filtering is highly selective.

Code generation improvements

Uses TEST instead of CMP for zero-checks where possible
Removes redundant AND/OR instructions (short-circuit jumps suffice)
Avoids redundant TEST/SETE sequences for equality comparisons by tracking comparison flags
Adds missing register XOR instructions to break false dependencies
Makes row index store conditional: used to be executed even when the filter is false for the given row

Benchmarks

On my box (Ryzen 7900x 64GB RAM Ubuntu 24.04) I've got the following difference in ClickBench's Hot Run (patch is on the left, master is on the right):

Also, I've got the following results in JMH benchmarks.

SqlJitCompilerScalarBenchmark (AND/OR predicate chains)

Benchmark	Before	After	Change
AND + SCALAR + EQ	111.2 ms	64.2 ms	+42% faster ✅
AND + SCALAR + NEQ	292.8 ms	278.1 ms	+5% faster ✅
OR + SCALAR + EQ	110.4 ms	85.3 ms	+23% faster ✅
OR + SCALAR + NEQ	295.2 ms	247.6 ms	+16% faster ✅

SqlJitCompilerSimdBenchmark (single predicate)

SIMD mode:

Column	Predicate	Before	After	Change
i64	EQ	39.7 ms	35.8 ms	+10% faster ✅
i64	NEQ	228.3 ms	228.2 ms	~same
i32	EQ	39.0 ms	18.6 ms	+52% faster ✅
i32	NEQ	215.9 ms	217.6 ms	~same
i16	EQ	33.0 ms	10.3 ms	+69% faster ✅
i16	NEQ	213.5 ms	208.3 ms	+2% faster ✅

Scalar mode:

Column	Predicate	Before	After	Change
i64	EQ	59.8 ms	63.3 ms	-6% ⚠️
i64	NEQ	244.5 ms	231.1 ms	+5% faster ✅
i32	EQ	47.4 ms	39.6 ms	+17% faster ✅
i32	NEQ	224.3 ms	218.0 ms	+3% faster ✅
i16	EQ	44.6 ms	46.3 ms	-4% ⚠️
i16	NEQ	220.6 ms	212.7 ms	+4% faster ✅

Key takeaways:

Biggest wins: SIMD mode with selective filters (EQ) - up to 69% faster for i16
Scalar AND/OR chains: Significant improvements (23-42% faster) due to short-circuit evaluation
NEQ predicates (low selectivity): Modest improvements since most rows match anyway
Minor regressions: Scalar mode with simple EQ predicates on i64/i16 shows slight overhead (~5%), likely due to additional setup for hoisting that doesn't pay off for simple single-predicate filters. On the other hand, single-predicate filters are compiled with SIMD in most cases.

A comparison of the generated code before-after

Let's consider the following query on hits table from ClickBench:

SELECT COUNT(*)
FROM hits
WHERE CounterID = 62
  AND EventTime >= '2013-07-01T00:00:00Z' AND EventTime <= '2013-07-31T23:59:59Z'
  AND IsRefresh = 0
  AND TraficSourceID IN (-1, 6)
  AND RefererHash = 3594120000172545465;

On my box it takes 41ms on master and 24ms with this patch.

Summary of the before/after assembly:

~44% fewer instructions in the hot loop
No redundant memory loads (TraficSourceID was loaded twice before)
Constants and column addresses hoisted out of the loop
Short-circuit evaluation - most rows fail early and skip remaining predicates
Conditional store - only writes to output when row matches (vs. always writing then conditionally incrementing)
No SETE/AND chains - replaced by direct conditional jumps

For highly selective filters (few matches), the short-circuit behavior provides the biggest win since most rows exit after the first 1-2 predicate checks.

On master the assembly generated by JIT looks like this:

.section .text {#0}
L1:                                         ; L1: i64@rax Func(u64@rdi data_ptr, i64@rsi data_size, u64@rdx varsize_aux_ptr, i64@rcx vars_ptr, u64@r8 vars_size, i64@r9 rows_ptr, u64@[0] rows_size, i64@[8] rows_id_start_offset, i64@[16] <none>)
push rbx
push rbp
push r12
mov r11, qword ptr [rsp+32]
mov r12, qword ptr [rsp+40]
mov r8, 0                                   ; mov input_index, 0
mov rax, 0                                  ; mov output_index, 0
cmp r8, r11                                 ; cmp input_index, rows_size
jge L3                                      ; jge L3
L2:                                         ; L2:
mov rcx, qword ptr [rdi+32]                 ; mov column_address, qword ptr [data_ptr+32]
mov rdx, qword ptr [rcx+r8*8]               ; mov i64_mem, qword ptr [column_address+input_index*8]
movabs rcx, 3594120000172545465             ; movabs i64_imm 40457657, 3594120000172545465
xor r10, r10                                ; xor %13, %13
cmp rdx, rcx                                ; cmp i64_mem, i64_imm 40457657
rex sete r10b                               ; sete %13@gpb
mov rcx, qword ptr [rdi+24]                 ; mov column_address, qword ptr [data_ptr+24]
mov edx, dword ptr [rcx+r8*4]               ; mov i32_mem, dword ptr [column_address+input_index*4]
mov ecx, 6                                  ; mov i32_imm 6, 6
xor ebx, ebx                                ; xor %17, %17
cmp edx, ecx                                ; cmp i32_mem, i32_imm 6
sete bl                                     ; sete %17@gpb
mov rcx, qword ptr [rdi+24]                 ; mov column_address, qword ptr [data_ptr+24]
mov edx, dword ptr [rcx+r8*4]               ; mov i32_mem, dword ptr [column_address+input_index*4]
mov ecx, -1                                 ; mov i32_imm -1, -1
xor esi, esi                                ; xor %21, %21
cmp edx, ecx                                ; cmp i32_mem, i32_imm -1
rex sete sil                                ; sete %21@gpb
; int32_or_start
or esi, ebx                                 ; or %21, %17
; int32_or_stop
mov rcx, qword ptr [rdi+16]                 ; mov column_address, qword ptr [data_ptr+16]
movsx edx, byte ptr [rcx+r8]                ; movsx i8_mem, byte ptr [column_address+input_index]
mov ecx, 0                                  ; mov i32_imm 0, 0
xor ebp, ebp                                ; xor %25, %25
cmp edx, ecx                                ; cmp i8_mem, i32_imm 0
rex sete bpl                                ; sete %25@gpb
mov rcx, qword ptr [rdi]                    ; mov column_address, qword ptr [data_ptr]
mov ebx, dword ptr [rcx+r8*4]               ; mov i32_mem, dword ptr [column_address+input_index*4]
mov ecx, 62                                 ; mov i32_imm 62, 62
xor edx, edx                                ; xor %29, %29
cmp ebx, ecx                                ; cmp i32_mem, i32_imm 62
sete dl                                     ; sete %29@gpb
and edx, ebp                                ; and %29, %25
and edx, esi                                ; and %29, %21
and edx, r10d                               ; and %29, %13@gpd
lea rcx, [r8+r12]                           ; lea input_index_+_rows_id_start_offset, [input_index+rows_id_start_offset]
mov qword ptr [r9+rax*8], rcx               ; mov qword ptr [rows_ptr+output_index*8], input_index_+_rows_id_start_offset
and edx, 1                                  ; and %29, 1
add rax, rdx                                ; add output_index, %29@gpq
add r8, 1                                   ; add input_index, 1
cmp r8, r11                                 ; cmp input_index, rows_size
jl L2                                       ; jl L2
L3:                                         ; L3:
L0:                                         ; L0:
pop r12
pop rbp
pop rbx
ret

With the patch, it's the following:

.section .text {#0}
L1:                                         ; L1: i64@rax Func(u64@rdi data_ptr, i64@rsi data_size, u64@rdx varsize_aux_ptr, i64@rcx vars_ptr, u64@r8 vars_size, i64@r9 rows_ptr, u64@[0] rows_size, i64@[8] rows_id_start_offset, i64@[16] <none>)
push rbx
push rbp
push r12
push r13
push r14
push r15
mov r14, qword ptr [rsp+56]
mov r15, qword ptr [rsp+64]
mov rdx, 0                                  ; mov input_index, 0
mov rax, 0                                  ; mov output_index, 0
mov r13, qword ptr [rdi+32]                 ; mov col_addr_4, qword ptr [data_ptr+32]
mov r12, qword ptr [rdi]                    ; mov col_addr_0, qword ptr [data_ptr]
mov r11, qword ptr [rdi+16]                 ; mov col_addr_2, qword ptr [data_ptr+16]
mov r10, qword ptr [rdi+24]                 ; mov col_addr_3, qword ptr [data_ptr+24]
mov r8, 3594120000172545465                 ; mov const_3594120000172545465, 3594120000172545465
mov rdi, 62                                 ; mov const_62, 62
mov rsi, 0                                  ; mov const_0, 0
mov rbp, 6                                  ; mov const_6, 6
mov rbx, -1                                 ; mov const_-1, -1
cmp rdx, r14                                ; cmp input_index, rows_size
jge L3                                      ; jge L3
L2:                                         ; L2:
mov rcx, qword ptr [r13+rdx*8]              ; mov col_4_i64, qword ptr [col_addr_4+input_index*8]
cmp rcx, r8                                 ; cmp col_4_i64, const_3594120000172545465
jne L4                                      ; jne L4
mov ecx, dword ptr [r12+rdx*4]              ; mov col_0_i32, dword ptr [col_addr_0+input_index*4]
cmp ecx, edi                                ; cmp col_0_i32, const_62@gpd
jne L4                                      ; jne L4
movsx ecx, byte ptr [r11+rdx]               ; movsx col_2_i8, byte ptr [col_addr_2+input_index]
cmp ecx, esi                                ; cmp col_2_i8, const_0@gpd
jne L4                                      ; jne L4
mov ecx, dword ptr [r10+rdx*4]              ; mov col_3_i32, dword ptr [col_addr_3+input_index*4]
cmp ecx, ebp                                ; cmp col_3_i32, const_6@gpd
je L6                                       ; je L6
cmp ecx, ebx                                ; cmp col_3_i32, const_-1@gpd
jne L4                                      ; jne L4
L6:                                         ; L6:
L5:                                         ; L5:
lea rcx, [rdx+r15]                          ; lea input_index_+_rows_id_start_offset, [input_index+rows_id_start_offset]
mov qword ptr [r9+rax*8], rcx               ; mov qword ptr [rows_ptr+output_index*8], input_index_+_rows_id_start_offset
add rax, 1                                  ; add output_index, 1
L4:                                         ; L4:
add rdx, 1                                  ; add input_index, 1
cmp rdx, r14                                ; cmp input_index, rows_size
short jl L2                                 ; jl L2
L3:                                         ; L3:
L0:                                         ; L0:
pop r15
pop r14
pop r13
pop r12
pop rbp
pop rbx
ret

coderabbitai · 2025-12-19T17:44:23Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch puzpuzpuz_jit_short_circuit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

… short-circuiting them

…nto puzpuzpuz_jit_short_circuit

…e columns

…rt_circuit

jit.output

benchmarks/src/main/java/org/questdb/SqlJitCompilerScalarBenchmark.java

glasstiger · 2025-12-24T01:49:08Z

[PR Coverage check]

😍 pass : 230 / 239 (96.23%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/jit/CompiledFilterIRSerializer.java	230	239	96.23%

puzpuzpuz self-assigned this Dec 19, 2025

puzpuzpuz added SQL Issues or changes relating to SQL execution Performance Performance improvements labels Dec 19, 2025

perf(sql): speed up JIT-compiled filters by reordering predicates and…

6d5f1e7

… short-circuiting them

puzpuzpuz force-pushed the puzpuzpuz_jit_short_circuit branch from 1c6938c to 6d5f1e7 Compare December 19, 2025 17:47

puzpuzpuz and others added 24 commits December 19, 2025 22:30

Use test instead of cmp when possible

ed3f7dc

Add short-circuiting for OR chain and add more priorities for predicates

91ca24d

Rebuild CXX libraries

bd0d96e

Intermediate fixes

0d9d61f

Rebuild CXX libraries

a1ca812

Add uuid eq/neq priorities

5e4929a

Fix predicate order and add scalar benchmark

c1a427b

Merge remote-tracking branch 'upstream/puzpuzpuz_jit_short_circuit' i…

eaf60fa

…nto puzpuzpuz_jit_short_circuit

Fix sc ir handling

707e051

Rebuild CXX libraries

3d6c7e5

Remove redundant AND/OR from assembly (short-circuit jumps are enough)

3f7deb6

Make store conditional in scalar loop

065a772

Introduce other eq/neq priorities

0a680f3

Implement column address hoisting

55b8648

Implement constant hoisting

9e1201a

Remove redundant test/sete instructions for short-circuited eq/neq

e6c0f74

Merge remote-tracking branch 'upstream/puzpuzpuz_jit_short_circuit' i…

71fc579

…nto puzpuzpuz_jit_short_circuit

Rebuild CXX libraries

ec631bd

Implement column value cache to get rid of redundant loads of the sam…

bcbe2e2

…e columns

Add short-circuiting for IN() operator

e25fafc

Unify short-circuiting logic and move labels creation to frontend

454d943

Fix serializer test

00b4a8e

Rebuild CXX libraries

addb6be

Add more serializer tests

d6a8754

Add some of the missing register xors and in-place register mutation

82569bd

puzpuzpuz force-pushed the puzpuzpuz_jit_short_circuit branch from d767bb3 to 82569bd Compare December 22, 2025 12:50

GitHub Actions - Rebuild Native Libraries and others added 9 commits December 22, 2025 12:55

Rebuild CXX libraries

a1a26d5

Fix int32_not

509089a

Add tests for IN() and fix found bugs

1dddf41

Add tests for constant/memory address hoisting

76c770f

Add more regression tests

8eb3f19

Implement constant hoisting for SIMD JIT

96ec865

Rebuild CXX libraries

7f458d7

Add short-circuiting for SIMD scatter

13a8911

Merge remote-tracking branch 'upstream/master' into puzpuzpuz_jit_sho…

fc8555d

…rt_circuit

puzpuzpuz marked this pull request as ready for review December 22, 2025 18:51

GitHub Actions - Rebuild Native Libraries and others added 3 commits December 22, 2025 18:55

Rebuild CXX libraries

3a1f1df

Fix IR serializer tests

229ccc4

Add more tests

481bb39

bluestreak01 reviewed Dec 24, 2025

View reviewed changes

jit.output Outdated Show resolved Hide resolved

bluestreak01 reviewed Dec 24, 2025

View reviewed changes

benchmarks/src/main/java/org/questdb/SqlJitCompilerScalarBenchmark.java Outdated Show resolved Hide resolved

bluestreak01 added 2 commits December 24, 2025 01:25

cleanup

4c05a20

cleanup

1012c15

bluestreak01 approved these changes Dec 24, 2025

View reviewed changes

bluestreak01 merged commit 9052c26 into master Dec 24, 2025
45 checks passed

bluestreak01 deleted the puzpuzpuz_jit_short_circuit branch December 24, 2025 02:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sql): speed up JIT-compiled filters by reordering predicates and short-circuiting them#6568

perf(sql): speed up JIT-compiled filters by reordering predicates and short-circuiting them#6568
bluestreak01 merged 40 commits intomasterfrom
puzpuzpuz_jit_short_circuit

puzpuzpuz commented Dec 19, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 19, 2025

Review skipped

Uh oh!

Uh oh!

Uh oh!

glasstiger commented Dec 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

puzpuzpuz commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Short-circuit evaluation

Predicate reordering

Register/memory hoisting

SIMD scatter short-circuiting

Code generation improvements

Benchmarks

A comparison of the generated code before-after

Uh oh!

coderabbitai bot commented Dec 19, 2025

Review skipped

Uh oh!

Uh oh!

Uh oh!

glasstiger commented Dec 24, 2025

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

puzpuzpuz commented Dec 19, 2025 •

edited

Loading