Skip to content

perf(sql): optimize parquet decode rowgroup performance#6632

Merged
bluestreak01 merged 53 commits intomasterfrom
parquet_decode_opt
Jan 28, 2026
Merged

perf(sql): optimize parquet decode rowgroup performance#6632
bluestreak01 merged 53 commits intomasterfrom
parquet_decode_opt

Conversation

@kafka1991
Copy link
Copy Markdown
Collaborator

@kafka1991 kafka1991 commented Jan 13, 2026

Optimizes Parquet partition read performance through three improvements:

1. Reduce memory allocations during Parquet page decoding

  • Eliminate intermediate copies for fixed-width column types during decodePage
  • Batch memory copy for definition-level bitpack decoding instead of per-value iteration
  • ~2x read performance improvement for affected workloads

2. Switch default Parquet compression to LZ4_RAW

  • Change from ZSTD(level=9) to LZ4_RAW
  • Faster decompression at the cost of slightly larger file size
  • ~2x additional read performance improvement compare to ZSTD

3. Use REQUIRED repetition for non-null Symbol columns

  • Skip definition-level decoding when Parquet Symbol column contains no nulls
  • ~10% read performance improvement

4. Skip redundant decode on ParquetBuffers cache hit

  • Benefits ASOF JOIN scenarios where timeFrame access the same Parquet frame
  • ~25% read performance improvement

Combined improvement: ~6x read performance vs master

Test Environment: M4 Pro

Test query:

WITH
buy AS
( SELECT timestamp, symbol, price FROM trades_parquet WHERE side = 'buy' ),
sell AS
( SELECT timestamp, symbol, price FROM trades_parquet WHERE side = 'sell' )
SELECT
   buy.timestamp timestamp,
   buy.symbol symbol,
   (buy.price - sell.price) spread
FROM buy ASOF JOIN sell ON (symbol) WHERE buy.timestamp BETWEEN '2025-01-01T01:30:00Z' AND '2025-01-08T20:45:00Z';

Results:

Partition Format Parquet File Size (Per partition) Query Time
Native 63.9 M 300ms
Master ZSTD(9) 34.8M ~17s
Patch ZSTD(9) 34.8M ~ 8.11s
Patch ZSTD(1) 35.2M ~ 7.92s
Patch SNAPPY 43.7M ~ 8.21s
Patch LZ4_RAW (REQUIRED symbol) 43.9M ~ 4.46s
Patch LZ4_RAW (OPTIONAL symbol) 43.9M ~ 4.81s
Patch LZ4_RAW (REQUIRED symbol) + cache hit opt 43.9M ~3.56s

Note: All performance numbers are from hot runs.

Final improvement: Master Branch ~17s → Patch 2.3 s

Why LZ4_RAW as default?

LZ4_RAW produces ~26% larger files than ZSTD(9) (43.9 MB vs 34.8 MB), but delivers 2x faster read performance

Notes & Future Work

  1. REQUIRED definition level potential: The REQUIRED repetition shows modest improvement for Symbol columns. Extending this to other non-null columns (e.g., price) could yield significant cumulative gains — worth exploring.

  2. Compression algorithm discussion: For more context on Parquet compression trade-offs, see Parquet Compression Benchmark.

  3. Broad applicability: These optimizations benefit all Parquet reads in QuestDB since they target the low-level decodePage path.

@kafka1991 kafka1991 marked this pull request as draft January 13, 2026 05:47
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 13, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Walkthrough

Adds bulk-skip/advance APIs and sink-based streaming to Parquet decoders/slicers, bulk in-place writes for column sinks, required-field propagation in writers/schema, symbol nullability encoding, cache-hit avoidance for cached parquet frames, and numerous test updates and defaults changes (codec and cache sizes).

Changes

Cohort / File(s) Summary
Bitpacked / unpack / bitmap
core/rust/qdbr/parquet2/src/encoding/bitpacked/decode.rs, core/rust/qdbr/parquet2/src/encoding/bitpacked/unpack.rs, core/rust/qdbr/parquet2/src/encoding/hybrid_rle/bitmap.rs
Added advance() on decoders/bitmap iterator to skip items efficiently; added #[inline] to unpack implementations; small fast-paths for 1–4 skips.
Allocator helper
core/rust/qdbr/src/allocator.rs
New AcVecSetLen trait and impl to allow unsafe in-place set_len for AcVec-backed Vecs.
Slicer & RLE refactor
core/rust/qdbr/src/parquet_read/slicer/mod.rs, core/rust/qdbr/src/parquet_read/slicer/rle.rs
Introduced ByteSink and Converter<const N> abstractions; added next_into/next_slice_into streaming APIs and skip/RepeatN for RLE iterator; updated slicer implementations to stream into sinks and propagate results.
Column sinks (fixed/var)
core/rust/qdbr/src/parquet_read/column_sink/fixed.rs, core/rust/qdbr/src/parquet_read/column_sink/var.rs, core/rust/qdbr/src/parquet_read/column_sink/mod.rs, core/rust/qdbr/src/parquet_read/column_sink/tests.rs
Bulk push and bulk-null paths added (0–4 fast paths + unsafe bulk using AcVecSetLen); added tests for sinks.
Decode path & boolean aggregation
core/rust/qdbr/src/parquet_read/decode.rs
Use DaysToMillisConverter via slicer generic; replaced slicer.next() with slicer.next_into() and added boolean run grouping to flush runs in blocks.
Slicer tests
core/rust/qdbr/src/parquet_read/slicer/tests.rs
Large test suite added covering slicers, skip/next semantics, converters, and sink paths.
Column writing / bulk nulls / converters
core/rust/qdbr/src/parquet_write/* (array.rs, binary.rs, boolean.rs, fixed_len_bytes.rs, primitive.rs, string.rs, varchar.rs, symbol.rs, util.rs)
build_plain_page() signature now takes required: bool; many writers pass the new flag; bulk null write paths added (match 0..4 fast paths + unsafe bulk using AcVecSetLen); symbol_to_pages gains required parameter; descriptor max_def_level set from required.
Schema / Column struct
core/rust/qdbr/src/parquet_write/schema.rs
Column gains required: bool; column_type_to_parquet_type takes required to set repetition/definition levels.
Rust decoding small changes
core/rust/qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs, core/rust/qdbr/parquet2/src/deserialize/utils.rs
Minor header bit adjustment and formatting; no behaviorally disruptive changes beyond header bit setting.
Error macro
core/rust/qdbr/src/parquet/error.rs
fmt_err! macro extended with pattern supporting variants that take an inner expression.
Page frame cache / cache hit
core/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.java
Added cacheHit flag in ParquetBuffers and logic to skip decode when cache is reused.
Symbol encoding & partition descriptors
core/src/main/java/io/questdb/cairo/O3PartitionJob.java, core/src/main/java/io/questdb/cairo/TableWriter.java, core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionEncoder.java, core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionDescriptor.java, core/src/main/java/io/questdb/griffin/engine/table/parquet/OwnedMemoryPartitionDescriptor.java
Symbol column type gets high-bit encoding when symbol map has no nulls; PartitionDescriptor packing adjusted to avoid sign-extension; symbol cleanup checks use ColumnType.tagOf.
Configuration & defaults
core/src/main/java/io/questdb/PropServerConfiguration.java, core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java, core/src/main/resources/io/questdb/site/conf/server.conf, pkg/ami/marketplace/assets/server.conf
Default Parquet compression: ZSTD → LZ4_RAW; partition encoder compression level from 9 → 0; frame cache capacity 3 → 8; test expectations updated.
Java tests & expectations
core/src/test/java/... (multiple)
Updated expected Parquet payload sizes and configuration-derived expectations; added/modified Parquet-related unit tests (including new symbol tests).

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related issues

Possibly related PRs

Suggested labels

SQL

Suggested reviewers

  • bluestreak01
  • puzpuzpuz
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: optimizing Parquet decode rowgroup performance through multiple targeted improvements.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering all four optimization areas with detailed explanations, performance metrics, and test results.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kafka1991 kafka1991 changed the title perf(sql): optimize parquet decode perfmance perf(sql): optimize parquet decode performance Jan 13, 2026
@kafka1991 kafka1991 marked this pull request as ready for review January 14, 2026 03:28
@kafka1991 kafka1991 changed the title perf(sql): optimize parquet decode performance perf(sql): optimize parquet decode rowgroup performance Jan 14, 2026
@kafka1991 kafka1991 added the Performance Performance improvements label Jan 14, 2026
@bluestreak01
Copy link
Copy Markdown
Member

@kafka1991 can we leverage bloom filters for these queries when those are present?

@bluestreak01
Copy link
Copy Markdown
Member

also, lets compare hot perf to duck

@kafka1991
Copy link
Copy Markdown
Collaborator Author

also, lets compare hot perf to duck

hey @bluestreak01 This optimization is only one part of the picture. My idea is to wait until we’ve finished all our internal optimizations before doing a performance comparison with duck. WDYT?

@bluestreak01
Copy link
Copy Markdown
Member

also, lets compare hot perf to duck

hey @bluestreak01 This optimization is only one part of the picture. My idea is to wait until we’ve finished all our internal optimizations before doing a performance comparison with duck. WDYT?

sure

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java (1)

689-696: Update getParquetExportCompressionLevel() to return 0 for LZ4_RAW.

The method returns 9 unconditionally, but the codec is now LZ4_RAW, which does not support compression levels (unlike ZSTD). This is inconsistent with PropServerConfiguration, which correctly defaults to 0 for non-ZSTD codecs, and with ExportModel.java, which explicitly treats LZ4_RAW as a codec that doesn't use compression level. The test expectations in ServerMainTest.java also confirm the compression level should be 0 for LZ4_RAW export.

Change line 695 to match the conditional logic used in PropServerConfiguration:

return parquetExportCompressionCodec == ParquetCompression.COMPRESSION_ZSTD ? 9 : 0;

Or, more directly, return 0 since the codec is fixed to LZ4_RAW at this configuration level.

core/rust/qdbr/src/parquet_write/fixed_len_bytes.rs (1)

68-82: Pass column.required through to bytes_to_page to avoid schema-encoding mismatch.

The Column.required field is available and correctly used for Symbol columns (line 765 in file.rs), but is not propagated to chunk_to_primitive_page or bytes_to_page. Since Long128, Uuid, and Long256 columns can be marked as required in the schema (via column_type_to_parquet_type), hardcoding required=false in build_plain_page creates a mismatch: the schema descriptor will advertise Repetition::Required while the page encoding includes definition levels and required=false. Update chunk_to_primitive_page to pass column.required, and refactor bytes_to_page to accept and use this flag to skip definition level encoding when appropriate.

core/rust/qdbr/src/parquet_write/varchar.rs (1)

232-277: Guard against usize overflow before unsafe bulk writes.

At Line 266 and Line 276, count * ENTRY_SIZE and base + ... can overflow, which would under-allocate and then write past the buffer inside the unsafe block. Add checked arithmetic and fail early.

✅ Suggested fix
         _ => {
             const ENTRY_SIZE: usize = 16; // 10 bytes header + 6 bytes offset
             let offset = data_mem.len();
             assert!(offset < VARCHAR_MAX_COLUMN_SIZE);

             let mut null_entry = [0u8; ENTRY_SIZE];
             null_entry[..10].copy_from_slice(&VARCHAR_HEADER_FLAG_NULL);
             null_entry[10..12].copy_from_slice(&(offset as u16).to_le_bytes());
             null_entry[12..16].copy_from_slice(&((offset >> 16) as u32).to_le_bytes());

             let base = aux_mem.len();
-            aux_mem.reserve(count * ENTRY_SIZE)?;
+            let total = count
+                .checked_mul(ENTRY_SIZE)
+                .ok_or_else(|| fmt_err!(OutOfBounds, "varchar null batch too large"))?;
+            let new_len = base
+                .checked_add(total)
+                .ok_or_else(|| fmt_err!(OutOfBounds, "varchar null batch too large"))?;
+            aux_mem.reserve(total)?;
             unsafe {
                 let ptr = aux_mem.as_mut_ptr().add(base);
                 for i in 0..count {
                     std::ptr::copy_nonoverlapping(
                         null_entry.as_ptr(),
                         ptr.add(i * ENTRY_SIZE),
                         ENTRY_SIZE,
                     );
                 }
-                aux_mem.set_len(base + count * ENTRY_SIZE);
+                aux_mem.set_len(new_len);
             }
             Ok(())
         }
core/rust/qdbr/src/parquet_write/symbol.rs (1)

159-221: Required fast‑path must exclude column_top/nulls.
If required is true while column_top > 0 or any value is -1, def levels are skipped and null_count remains 0, which corrupts page encoding (and non_null_len). Ensure required is only true when there are truly no nulls, or fall back to the optional path in those cases.

🐛 Suggested fix
-    let mut data_buffer = vec![];
+    let mut data_buffer = vec![];
+    let required_no_nulls = required && column_top == 0;
 
-    let definition_levels_byte_length = if required {
+    let definition_levels_byte_length = if required_no_nulls {
         0
     } else {
         // TODO(amunra): Optimize if there's no column top.
         let deflevels_iter = (0..num_rows).map(|i| {
             if i < column_top {
                 false
             } else {
                 let key = column_values[i - column_top];
                 // negative denotes a null value
                 if key > -1 {
                     true
                 } else {
                     null_count += 1;
                     false
                 }
             }
         });
         encode_primitive_def_levels(&mut data_buffer, deflevels_iter, num_rows, options.version)?;
         data_buffer.len()
     };
@@
-    let data_page = build_plain_page(
+    let data_page = build_plain_page(
         data_buffer,
         num_rows,
         null_count,
         definition_levels_byte_length,
         if options.write_statistics {
             Some(stats.into_parquet_stats(null_count))
         } else {
             None
         },
         primitive_type,
         options,
         Encoding::RleDictionary,
-        required,
+        required_no_nulls,
     )?;
core/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.java (1)

126-135: Guard cache against addressCache changes: buffers with the same frameIndex across different parquet files will cause stale data reuse.

The cacheHit mechanism is keyed only by frameIndex. When PageFrameMemoryPool.of() switches to a new PageFrameAddressCache, it frees the parquetDecoder but does not clear cachedParquetBuffers. If the new addressCache has a frame with the same frameIndex as one cached from the previous addressCache, decode() will be skipped due to the cache hit, reusing buffers that contain data from the old parquet file.

Pools are reused across different addressCache instances (visible in TimeFrameCursorImpl, LatestByTask, PageFrameReduceTask), making frameIndex collisions possible. Add releaseParquetBuffers() call in PageFrameMemoryPool.of() to clear stale buffers when switching addressCache, or call it when parquetDecoder switches files in openParquet().

🧹 Suggested fix
     public void of(PageFrameAddressCache addressCache) {
         this.addressCache = addressCache;
         frameMemory.clear();
+        releaseParquetBuffers();
         Misc.free(parquetDecoder);
     }
core/src/main/java/io/questdb/cairo/TableWriter.java (1)

1469-1502: Mask the high-bit nullability flag before any ColumnType operations—this is a critical bug.

Lines 1470–1473 set encodeColumnType |= 1 << 31 for symbol columns with no nulls, making the stored type negative. However, PartitionDecoder.getColumnType() (line 350–351) returns this raw int without masking, and the decoder immediately passes it to ColumnType.isSymbol(), ColumnType.isUndefined(), etc. (lines 298–303) which use direct equality checks.

When the high bit is set, these equality checks fail. For example, SYMBOL (20) | (1 << 31) = -2147483628, and -2147483628 == 20 is false, so the decoder misidentifies symbol columns with no nulls as undefined or fails type validation.

The fix requires masking the high bit before any ColumnType.*() call. Apply columnType & 0x7FFFFFFF in PartitionDecoder.getColumnType() or immediately after calling it in the decoder loop.

🤖 Fix all issues with AI agents
In `@core/rust/qdbr/parquet2/src/encoding/hybrid_rle/bitmap.rs`:
- Around line 86-101: The branch that handles advancing the mask uses `if count
<= bits_left_in_byte` which fails to advance the byte when `count ==
bits_left_in_byte`; change the condition to `if count < bits_left_in_byte` so an
exact-byte consume falls through to the else branch that advances `self.iter`,
updates `self.current_byte`, and sets `self.mask` (the logic around
`self.mask.rotate_left`, `self.iter`, `self.current_byte`, and `self.mask =
1u8.rotate_left(final_bits as u32)` should remain as-is in the else path).

In `@core/rust/qdbr/src/parquet_read/column_sink/fixed.rs`:
- Around line 67-96: The unsafe bulk-write in push_nulls (inside the _ arm) can
exceed the vector capacity because there is no runtime check before calling
ptr::copy_nonoverlapping and AcVecSetLen::set_len; add a guard to ensure
capacity: either call self.buffers.data_vec.reserve(total_bytes) (or
reserve_exact) right before the unsafe block or add a debug_assert!(base +
total_bytes <= self.buffers.data_vec.capacity()) to validate the invariant;
update the same pattern wherever similar unsafe bulk writes occur (e.g., other
methods that use self.buffers.data_vec, AcVecSetLen::set_len, null_value, and N)
so the pointer writes are guaranteed safe at the write site.
- Around line 344-357: In push_int96_as_epoch_nanos, fix endianness by decoding
the 8-byte nanoseconds and 4-byte Julian day explicitly as little-endian instead
of using ptr::read_unaligned (which preserves host endianness); copy bytes[0..8]
into a [u8;8] and use u64::from_le_bytes for nanos, copy bytes[8..12] into a
[u8;4] and use u32::from_le_bytes for julian_date, then compute days_since_epoch
using JULIAN_UNIX_EPOCH_OFFSET and NANOS_PER_DAY and extend data_vec with
nanos_since_epoch.to_le_bytes() as before.

In `@core/rust/qdbr/src/parquet_read/column_sink/var.rs`:
- Around line 199-214: The unsafe write into self.buffers.data_vec using
ptr::write_bytes can overflow because capacity isn't ensured; before calling
ptr::write_bytes and set_len in the match arm (where ELEM = size_of::<i32>() and
base = self.buffers.data_vec.len()), reserve enough space for base + count *
ELEM (e.g., call reserve or reserve_exact on self.buffers.data_vec for count *
ELEM) so the pointer write is safe, then perform the ptr::write_bytes and
set_len, and finally call write_offset_sequence(&mut self.buffers.aux_vec, base
+ ELEM, ELEM, count); ensure you reference data_vec, ELEM, base, and
write_offset_sequence when applying the fix.
- Around line 311-325: The null-fill branch in var.rs writes count * ELEM bytes
into self.buffers.data_vec without ensuring capacity, risking a buffer overflow;
before the unsafe ptr::write_bytes and set_len on data_vec (inside
BinaryColumnSink::push_nulls / this match arm), call reserve/reserve_exact to
allocate at least count * ELEM additional bytes (or ensure capacity >= base +
count * ELEM), and likewise ensure aux_vec has enough capacity for
write_offset_sequence; then perform the unsafe write_bytes and set_len as
before.

In `@core/rust/qdbr/src/parquet_read/slicer/mod.rs`:
- Around line 49-89: The unsafe ByteSink::extend_from_slice implementations (for
AcVec<u8> and Vec<u8>) assume reserved capacity but append_array currently
writes shape and padding via the unsafe extend_from_slice before reserving for
them (buffers.data_vec.reserve(slicer.data_size()) only covers element values),
so either reserve the extra bytes up-front or use the safe path; modify
append_array to call data_mem.reserve(...) (or buffers.data_vec.reserve(...)
including shape/padding size) before any calls to extend_from_slice that write
shape or padding metadata, or switch those initial writes to
extend_from_slice_safe() to ensure capacity checks; reference the ByteSink impls
(extend_from_slice / extend_from_slice_safe) and the append_array call site to
make this change.

In `@core/rust/qdbr/src/parquet_write/array.rs`:
- Around line 745-785: The unsafe bulk write can overflow because count *
ENTRY_SIZE and base + total_bytes are used unchecked; before calling
aux_mem.reserve(...) and AcVecSetLen::set_len(...) in the default branch,
perform checked arithmetic (use checked_mul for count and ENTRY_SIZE and
checked_add for base and total_bytes), return an Err on overflow, and only then
proceed with reserve and the unsafe copy; reference the identifiers count,
ENTRY_SIZE, total_bytes, base, aux_mem, data_mem, append_array_null, and
AcVecSetLen::set_len when applying these checks.

In `@core/rust/qdbr/src/parquet_write/schema.rs`:
- Around line 26-35: The current blanket is_notnull_type check (matching
ColumnTypeTag::Boolean | Byte | Short | Char) incorrectly forces
Repetition::Required and drops definition levels; change the logic so those
types are only treated as Required when the incoming required parameter is true
(i.e., remove or gate the is_notnull_type condition), so compute repetition from
designated_timestamp || required (and not from type alone) when setting the
repetition variable used for Parquet schema generation; update any references to
is_notnull_type, repetition, designated_timestamp, and required in this function
to reflect this corrected gating.

In `@core/src/main/resources/io/questdb/site/conf/server.conf`:
- Around line 608-612: The server.conf default for
cairo.partition.encoder.parquet.compression.codec (currently commented as
LZ4_RAW) is inconsistent with the Java default returned by
DefaultCairoConfiguration.getPartitionEncoderParquetCompressionCodec()
(ParquetCompression.COMPRESSION_ZSTD); update the commented default in
server.conf to match the Java default (set to ZSTD and ensure the allowed codec
list/comments include ZSTD), or alternatively change the Java default to
LZ4_RAW—pick one consistent canonical default and make the config key
(cairo.partition.encoder.parquet.compression.codec) and
DefaultCairoConfiguration.getPartitionEncoderParquetCompressionCodec() agree.
🧹 Nitpick comments (7)
core/src/main/java/io/questdb/cutlass/parquet/CopyExportRequestTask.java (1)

396-407: Inconsistent column type retrieval in non-symbol branch.

Line 396 extracts columnType into a final local variable, but line 406 re-fetches it via metadata.getColumnType(i) instead of reusing columnType. This is functionally correct but inconsistent with the symbol branch (line 400) which uses the local variable.

♻️ Suggested fix for consistency
                 } else {
-                    columnMetadata.add((long) metadata.getWriterIndex(i) << 32 | metadata.getColumnType(i));
+                    columnMetadata.add((long) metadata.getWriterIndex(i) << 32 | columnType);
                 }
core/rust/qdbr/parquet2/src/encoding/bitpacked/decode.rs (1)

62-95: LGTM — minor optimization opportunity.

The advance method correctly maintains invariants:

  • remaining is decremented first, ensuring next() won't read stale data if packs are exhausted
  • Pack boundary crossing is handled correctly
  • The edge case where packed.next() returns None is safe since remaining would be zero
♻️ Optional: Use `nth()` instead of loop for skipping packs
-            let packs_to_skip = to_skip / T::Unpacked::LENGTH;
-            for _ in 0..packs_to_skip {
-                self.packed.next();
-            }
+            let packs_to_skip = to_skip / T::Unpacked::LENGTH;
+            if packs_to_skip > 0 {
+                self.packed.nth(packs_to_skip - 1);
+            }

Using nth(n-1) skips n elements more efficiently than a loop, as Chunks iterator may optimize this.

core/src/main/java/io/questdb/cairo/O3PartitionJob.java (2)

1774-1781: Use a named flag for required symbol encoding.
The sign-bit mask in Line 1775 is correct but opaque; a named constant makes the intent explicit and reduces the chance of accidental misuse.

♻️ Suggested tweak
-                        if (!symbolMapWriter.getNullFlag()) {
-                            encodeColumnType |= 1 << 31;
-                        }
+                        if (!symbolMapWriter.getNullFlag()) {
+                            encodeColumnType |= Integer.MIN_VALUE; // required (sign bit)
+                        }

2496-2511: Treat ENOENT as benign during phantom-dir cleanup.
If the directory is already gone, logging an error is noisy. Consider skipping the error log when errno indicates “not found.”

core/rust/qdbr/src/parquet_write/schema.rs (1)

302-339: Make the required‑bit mask explicit.
column_type < 0 and 0x7FFFFFFF are magic values; a named mask improves readability and reduces sign‑bit confusion.

♻️ Suggested refactor
-        let required = column_type < 0;
-        let column_type: ColumnType = (column_type & 0x7FFFFFFF).try_into()?;
+        const REQUIRED_FLAG: i32 = 0x8000_0000;
+        let required = (column_type & REQUIRED_FLAG) != 0;
+        let column_type: ColumnType = (column_type & !REQUIRED_FLAG).try_into()?;
core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionEncoder.java (1)

164-175: Prefer a named constant for the symbol‑required flag.

Using 1 << 31 is clear for us now but easy to miss later. A named constant keeps intent explicit and avoids duplication if reused elsewhere.

♻️ Suggested tweak
 public class PartitionEncoder {
+    private static final int SYMBOL_REQUIRED_FLAG = 1 << 31;
     ...
-                    if (!symbolMapReader.containsNullValue()) {
-                        encodeColumnType |= 1 << 31;
-                    }
+                    if (!symbolMapReader.containsNullValue()) {
+                        encodeColumnType |= SYMBOL_REQUIRED_FLAG;
+                    }
core/rust/qdbr/src/parquet_read/slicer/mod.rs (1)

12-31: SliceSink overwrites instead of appending.

ByteSink::extend_from_slice implies append semantics. The current SliceSink always writes at offset 0, so any converter that writes in multiple chunks will clobber prior bytes. Adding a cursor keeps behavior consistent with other ByteSink impls.

♻️ Proposed fix
-pub struct SliceSink<'a>(pub &'a mut [u8]);
+pub struct SliceSink<'a> {
+    buf: &'a mut [u8],
+    pos: usize,
+}
 
 impl ByteSink for SliceSink<'_> {
     #[inline]
     fn extend_from_slice(&mut self, data: &[u8]) -> ParquetResult<()> {
-        self.0[..data.len()].copy_from_slice(data);
+        let end = self.pos + data.len();
+        self.buf[self.pos..end].copy_from_slice(data);
+        self.pos = end;
         Ok(())
     }
 
     #[inline]
     fn extend_from_slice_safe(&mut self, data: &[u8]) -> ParquetResult<()> {
-        self.0[..data.len()].copy_from_slice(data);
+        let end = self.pos + data.len();
+        if end > self.buf.len() {
+            return Err(fmt_err!(Layout, "SliceSink overflow"));
+        }
+        self.buf[self.pos..end].copy_from_slice(data);
+        self.pos = end;
         Ok(())
     }
 }
@@
-        C::convert(slice, &mut SliceSink(&mut self.buffer));
+        C::convert(slice, &mut SliceSink { buf: &mut self.buffer, pos: 0 });

@kafka1991
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review again

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 26, 2026

@kafka1991: I'll perform another comprehensive review of the PR, focusing on the Parquet optimization changes including the memory management improvements, cache hit optimization, and the REQUIRED repetition handling for non-null Symbol columns.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

GitHub Actions - Rebuild Native Libraries and others added 2 commits January 26, 2026 02:01
@kafka1991
Copy link
Copy Markdown
Collaborator Author

kafka1991 commented Jan 26, 2026

Note: Some of the code is covered by tests in #6675

@kafka1991
Copy link
Copy Markdown
Collaborator Author

@coderabbitai I addressed your comments and added tests, pls review again

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 26, 2026

@kafka1991: Thank you for addressing the comments and adding tests! I'll perform another comprehensive review of the PR, focusing on the changes you've made and the new test coverage.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
core/src/main/java/io/questdb/cairo/TableWriter.java (1)

1469-1502: Symbol type classification fails in PartitionDecoder for non-nullable symbols.

The encoding in TableWriter.java:1472 sets Integer.MIN_VALUE on symbol types lacking null values. However, PartitionDecoder.copyToSansUnsupported() calls ColumnType.isSymbol(columnType) without masking (line 303), and ColumnType.isSymbol() performs direct equality: columnType == SYMBOL. When the flag is set, a symbol becomes negative (e.g., 0x80000000 | 12 = -2147483636), which fails the equality check and returns false. Lines 298 (isUndefined) and 319 (isTimestamp) have the same issue.

Apply the pattern shown in OwnedMemoryPartitionDescriptor.java:47: call ColumnType.tagOf(columnType) to strip the flag before any ColumnType API check:

// In PartitionDecoder.copyToSansUnsupported() around line 296-303
final int columnType = getColumnType(i);
final int taggedType = ColumnType.tagOf(columnType);

if (ColumnType.isUndefined(taggedType)) { ... }
if (ColumnType.isSymbol(taggedType)) { ... }
// and line 319: if (ColumnType.isTimestamp(taggedType) && ...)

This affects the TableWriter.convertPartitionNativeToParquet() flow at line 1683 where metadata is read from Parquet.

core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java (1)

688-695: Fix compression level for LZ4_RAW codec.

Returning 9 is incorrect. LZ4_RAW does not support compression levels in the Parquet specification. The codebase itself confirms this: PropServerConfiguration uses level 0 as the default for LZ4_RAW (only ZSTD gets 9), ExportModel explicitly documents that LZ4_RAW doesn't use compression levels, and test expectations set this to 0. The partition encoder path correctly returns 0 for the same codec. Change this to return 0 to align with the codec and other configuration paths.

Proposed change
     `@Override`
     public int getParquetExportCompressionLevel() {
-        return 9;
+        return 0;
     }
🤖 Fix all issues with AI agents
In `@core/src/main/java/io/questdb/cutlass/parquet/CopyExportRequestTask.java`:
- Around line 398-408: In CopyExportRequestTask inside the block handling symbol
columns (where ColumnType.isSymbol(columnType) is true), add a null check/assert
for symbolTable returned by pageFrameCursor.getSymbolTable(i) (e.g., assert
symbolTable != null : "Symbol table expected for symbol column " + i) before
calling symbolTable.containsNullValue() to avoid an NPE; also replace the magic
bit mask 1 << 31 with a named constant (e.g., SYMBOL_NON_NULL_FLAG) and use that
constant when setting symbolColumnType so the encoding is clear and
maintainable.
🧹 Nitpick comments (3)
core/rust/qdbr/src/parquet_read/column_sink/fixed.rs (1)

136-148: Consider adding debug_assert for consistency with bulk paths.

The single-element push() performs an unsafe write without a capacity check, relying on the caller having called reserve(). While this follows the same pattern as other single-push methods and is functionally correct when the contract is honored, adding a debug_assert! would provide consistency with the bulk paths and catch contract violations during development.

♻️ Optional: Add debug_assert for consistency
     fn push(&mut self) -> ParquetResult<()> {
         let slice = self.slicer.next();
         let base = self.buffers.data_vec.len();
+        debug_assert!(base + N <= self.buffers.data_vec.capacity());

         unsafe {
             let ptr = self.buffers.data_vec.as_mut_ptr().add(base);
core/rust/qdbr/src/parquet_read/slicer/tests.rs (1)

499-502: Minor: Consider bounds checking in test helper.

The get_dict_value method uses direct indexing which will panic on out-of-bounds. This is acceptable for test code as it will catch bugs, but you could optionally use .get() with a more descriptive panic message for easier debugging.

core/rust/qdbr/src/parquet_read/slicer/mod.rs (1)

619-666: Consider removing unused error field from ValueConvertSlicer.

The error field is initialized to Ok(()) and never modified in the current implementation. The result() method includes it in the chain, but it always returns Ok(()). Either remove it or document the intended future use.

♻️ Optional: Remove unused error field
 pub struct ValueConvertSlicer<const N: usize, T: DataPageSlicer, C: Converter<N>> {
     inner_slicer: T,
-    error: ParquetResult<()>,
     buffer: [u8; N],
     _converter: std::marker::PhantomData<C>,
 }

 // ...

     fn result(&self) -> ParquetResult<()> {
-        self.error.clone().or(self.inner_slicer.result())
+        self.inner_slicer.result()
     }
 }

 impl<const N: usize, T: DataPageSlicer, C: Converter<N>> ValueConvertSlicer<N, T, C> {
     pub fn new(inner_slicer: T) -> Self {
         Self {
             inner_slicer,
-            error: Ok(()),
             buffer: [0; N],
             _converter: std::marker::PhantomData,
         }
     }
 }

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 1376 / 1594 (86.32%)

file detail

path covered line new line coverage
🔵 qdbr/parquet2/src/encoding/hybrid_rle/bitmap.rs 0 39 00.00%
🔵 qdbr/parquet2/src/deserialize/utils.rs 0 3 00.00%
🔵 qdbr/src/parquet_write/array.rs 2 32 06.25%
🔵 qdbr/src/parquet_read/slicer/rle.rs 72 109 66.06%
🔵 qdbr/src/parquet_read/slicer/mod.rs 123 175 70.29%
🔵 qdbr/src/parquet_read/column_sink/fixed.rs 119 162 73.46%
🔵 qdbr/parquet2/src/encoding/bitpacked/decode.rs 21 24 87.50%
🔵 io/questdb/cairo/O3PartitionJob.java 8 9 88.89%
🔵 qdbr/src/parquet_read/column_sink/var.rs 76 79 96.20%
🔵 qdbr/src/parquet_write/varchar.rs 29 30 96.67%
🔵 qdbr/src/parquet_read/slicer/tests.rs 550 556 98.92%
🔵 qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs 1 1 100.00%
🔵 io/questdb/griffin/engine/table/parquet/OwnedMemoryPartitionDescriptor.java 1 1 100.00%
🔵 qdbr/src/parquet_write/symbol.rs 16 16 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionDescriptor.java 10 10 100.00%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 8 8 100.00%
🔵 qdbr/src/parquet_write/file.rs 1 1 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionEncoder.java 3 3 100.00%
🔵 io/questdb/cairo/TableWriter.java 5 5 100.00%
🔵 qdbr/src/parquet_write/util.rs 3 3 100.00%
🔵 io/questdb/PropServerConfiguration.java 5 5 100.00%
🔵 qdbr/src/allocator.rs 4 4 100.00%
🔵 qdbr/src/parquet_read/decode.rs 21 21 100.00%
🔵 io/questdb/cutlass/parquet/CopyExportRequestTask.java 12 12 100.00%
🔵 io/questdb/cairo/sql/PageFrameMemoryPool.java 13 13 100.00%
🔵 qdbr/src/parquet_read/column_sink/tests.rs 243 243 100.00%
🔵 qdbr/src/parquet_write/schema.rs 30 30 100.00%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants