perf(parquet): breaking change 💥 - improve strings decoding performance for parquet files#6809
perf(parquet): breaking change 💥 - improve strings decoding performance for parquet files#6809bluestreak01 merged 68 commits intomasterfrom
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds a new in-memory VARCHAR_SLICE type and end-to-end Parquet slice encoding/decoding support across Rust and Java: new decoders/sinks, schema/metadata ascii flag, writer dictionary path, Java runtime accessors and type mappings, benchmarks, and extensive tests validating aux-pointer semantics and Parquet read/write roundtrips. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6a21f4b to
0481828
Compare
… across encoders and decoders
… and update related methods
48a931f to
8554e85
Compare
|
@coderabbitai please review |
|
🧠 Learnings used✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 17
🧹 Nitpick comments (6)
core/rust/qdbr/tests/strings.rs (1)
178-179: Turn this note into executable coverage (avoid comment drift).Lines 178-179 document unsupported encodings, but this can silently go stale. Please add explicit tests that assert
Encoding::RleDictionaryandEncoding::DeltaByteArrayreturn the expected “unsupported” failure forColumnTypeTag::String(or mark as#[ignore]with TODO if the error contract is still being finalized).Proposed test scaffold
#[test] fn test_string_delta_length_byte_array() { run_string_test("String", Encoding::DeltaLengthByteArray); } -// Note: String type only supports Plain and DeltaLengthByteArray encodings. -// RleDictionary and DeltaByteArray are not implemented for String decode. +#[test] +fn test_string_rle_dictionary_unsupported() { + // TODO: replace with exact error assertion once contract is finalized. + // e.g. assert!(decode_result.is_err_and(|e| e.to_string().contains("unsupported encoding"))); +} + +#[test] +fn test_string_delta_byte_array_unsupported() { + // TODO: replace with exact error assertion once contract is finalized. +}Based on learnings: in questdb/questdb reviews requested by javier, provide line-cited push-back with concrete follow-ups (tests/diffs/perf considerations).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/rust/qdbr/tests/strings.rs` around lines 178 - 179, Add executable tests in core/rust/qdbr/tests/strings.rs that assert attempting to decode ColumnTypeTag::String with Encoding::RleDictionary and Encoding::DeltaByteArray fails with the expected “unsupported” error; locate the test module near the existing String tests and add two cases that call the same decode/reader entry point used elsewhere in this file (e.g. the routine that maps Encoding -> decoder) and assert it returns the unsupported/Err variant, or mark each test with #[ignore] and a TODO comment if the exact error contract is still unstable; reference Encoding::RleDictionary, Encoding::DeltaByteArray and ColumnTypeTag::String in the assertions so the test will fail if the unsupported-list comment drifts.core/rust/qdbr/benches/decode_page.rs (1)
603-613: VarcharSlice benchmark matrix is currently ASCII-fast-path only.Line 1845 and Line 1907 force
ascii: Some(true), and the generator in Lines 603-613 emits ASCII-only payloads. This misses non-ASCII decode behavior and can overstate generalized gains.📈 Suggested extension to cover both ASCII and non-ASCII paths
-fn make_varchar_data_sized(row_count: usize, null_pct: u8, str_len: usize) -> VarcharData { +fn make_varchar_data_sized( + row_count: usize, + null_pct: u8, + str_len: usize, + ascii: bool, +) -> VarcharData { @@ - while v.len() < str_len { - v.push(b'a' + (v.len() % 26) as u8); - } + while v.len() < str_len { + if ascii { + v.push(b'a' + (v.len() % 26) as u8); + } else { + v.extend_from_slice("é".as_bytes()); + } + } v.truncate(str_len); v }) .collect(); @@ - let header = - ((size as u8) << HEADER_FLAGS_WIDTH as u8) | HEADER_FLAG_INLINED | HEADER_FLAG_ASCII; + let header = ((size as u8) << HEADER_FLAGS_WIDTH as u8) + | HEADER_FLAG_INLINED + | if ascii { HEADER_FLAG_ASCII } else { 0 }; @@ - let header = ((size as u32) << HEADER_FLAGS_WIDTH) | HEADER_FLAG_ASCII_32; + let header = ((size as u32) << HEADER_FLAGS_WIDTH) + | if ascii { HEADER_FLAG_ASCII_32 } else { 0 };-for &str_len in &[2usize, 200] { +for &(str_len, ascii) in &[(2usize, true), (200usize, true), (200usize, false)] { for &null_pct in null_pcts(true) { - let data = make_varchar_data_sized(ROW_COUNT, null_pct, str_len); + let data = make_varchar_data_sized(ROW_COUNT, null_pct, str_len, ascii); @@ - format!("varchar_slice_delta_len_s{str_len}_n{null_pct}"), + format!("varchar_slice_delta_len_a{ascii}_s{str_len}_n{null_pct}"), @@ - Some(true), + Some(ascii), ROW_COUNT, )); } }Mirror the same
asciimatrix in the dictionary-encodedvarchar_slice_dict_*block.Also applies to: 1839-1847, 1901-1908
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/rust/qdbr/benches/decode_page.rs` around lines 603 - 613, The benchmark currently only generates ASCII payloads via the values Vec< Vec<u8> > generator and forces ascii: Some(true) in the varchar_slice and varchar_slice_dict benchmarks, missing non-ASCII decode paths; update the values generator in decode_page.rs to produce both ASCII and non-ASCII variants (e.g., include at least one multibyte UTF-8 character such as "ñ" or a CJK character in some entries or alternate rows) and extend the benchmark matrix for the dictionary-encoded varchar_slice_dict_* cases to mirror the same ascii: Some(true/false) variants (i.e., add ascii: Some(false) runs) so both ASCII and non-ASCII decode paths are exercised. Ensure you modify the generator that builds values and the varchar_slice_dict_* benchmark configuration to include non-ASCII cases.core/rust/qdb-core/src/col_driver/mod.rs (1)
85-88: Add a targeted test for the newVarcharSlicerejection branch.You introduced a new error path in
try_lookup_driver, buttest_lookup_drivercurrently only validates success cases. A small negative test will lock this behavior down.✅ Suggested test addition
#[test] fn test_lookup_driver() { @@ } +#[test] +fn test_lookup_driver_rejects_varchar_slice() { + let err = try_lookup_driver(ColumnTypeTag::VarcharSlice.into_type()).unwrap_err(); + let msg = err.to_string(); + assert!(msg.contains("transient in-memory type")); +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/rust/qdb-core/src/col_driver/mod.rs` around lines 85 - 88, Add a negative unit test that asserts try_lookup_driver returns the InvalidType error for ColumnTypeTag::VarcharSlice: update the test suite that contains test_lookup_driver to include a case calling try_lookup_driver(ColumnTypeTag::VarcharSlice, ...) and assert it Errs with the fmt_err InvalidType message (or matches the error variant), so the new rejection branch in mod.rs is covered; use the same helper/setup used by test_lookup_driver to obtain any necessary context/arguments.core/src/test/java/io/questdb/test/cairo/map/UnorderedVarcharMapTest.java (1)
264-292: Add a deferredcopyFrom()scenario with unstable source keys.
testDeferredKeyCopyCopyFromcurrently validates stable-source behavior. A companion case with unstable source pointers would pin the deferred-copy branch inKey.copyFrom().✅ Minimal test addition sketch
+ `@Test` + public void testDeferredKeyCopyCopyFromUnstable() throws Exception { + TestUtils.assertMemoryLeak(() -> { + SingleColumnType valueTypes = new SingleColumnType(ColumnType.INT); + try ( + UnorderedVarcharMap mapA = newDeferredKeyCopyMap(valueTypes); + UnorderedVarcharMap mapB = newDeferredKeyCopyMap(valueTypes) + ) { + final int n = 1_000; + for (int i = 0; i < n; i++) { + MapKey keyA = mapA.withKey(); + keyA.putVarchar(new Utf8String("u" + i)); // unstable/on-heap + + MapKey keyB = mapB.withKey(); + keyB.copyFrom(keyA); + MapValue valueB = keyB.createValue(); + Assert.assertTrue(valueB.isNew()); + valueB.putInt(0, i); + } + for (int i = 0; i < n; i++) { + Assert.assertEquals(i, get("u" + i, mapB)); + } + } + }); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/test/java/io/questdb/test/cairo/map/UnorderedVarcharMapTest.java` around lines 264 - 292, Add a companion test that exercises MapKey.copyFrom() when the source key is unstable (unpinned) so the deferred-copy branch in Key.copyFrom() is covered: create a new test (e.g., testDeferredKeyCopyCopyFrom_unstable) mirroring testDeferredKeyCopyCopyFrom but use the unstable-key insertion helper (the counterpart to putStable, e.g., putUnstable or the method that leaves source pointers non-pinned) when populating mapA, then perform the same copyFrom into mapB (using UnorderedVarcharMap, MapKey.copyFrom, newDeferredKeyCopyMap, DirectUtf8Sink) and assert sizes and values as in the existing test to verify correctness.core/rust/qdbr/src/parquet_read/decode.rs (1)
2376-2381: Consider a small test helper forQdbMetaColconstruction.
QdbMetaCol { ..., ascii: None }is now repeated across many tests; a helper will reduce churn and future field-drift errors.♻️ Minimal refactor sketch
+ fn test_meta_col(column_type: ColumnType, format: Option<QdbMetaColFormat>) -> QdbMetaCol { + QdbMetaCol { + column_type, + column_top: 0, + format, + ascii: None, + } + } ... - let col_info = QdbMetaCol { - column_type, - column_top: 0, - format: None, - ascii: None, - }; + let col_info = test_meta_col(column_type, None);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/rust/qdbr/src/parquet_read/decode.rs` around lines 2376 - 2381, Multiple tests repeat QdbMetaCol { column_type, column_top: 0, format: None, ascii: None }; add a small test helper to centralize construction (e.g., a function like test_qdb_meta_col(column_type: Type) or a builder like QdbMetaCol::test_with(column_type) in the tests/util module) that returns a QdbMetaCol with column_top=0, format=None, ascii=None so tests call that helper instead of repeating the literal; update existing tests to use the helper to reduce duplication and prevent future field-drift.core/rust/qdbr/src/parquet_read/row_groups.rs (1)
130-147: Extract remapping rules into one helper.Symbol/Varchar/VarcharSlice compatibility rules are implemented three times. Consolidating this into one function will reduce drift risk between full decode, filtered decode, and stats validation.
Also applies to: 257-270, 779-786
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/rust/qdbr/src/parquet_read/row_groups.rs` around lines 130 - 147, The symbol/varchar/varcharslice compatibility logic is duplicated across read_parquet, filtered decode, and stats validation (e.g., the blocks comparing ColumnTypeTag::Symbol, ::Varchar and ::VarcharSlice that reassign column_type = to_column_type); extract this logic into a single helper function (e.g., normalize_parquet_column_type or remap_symbol_varchar) that takes (column_type, to_column_type) and returns the remapped ColumnType; replace the three duplicated blocks (including the occurrences around read_parquet and the ranges noted) with calls to that helper so all callers use the same remapping behavior and avoid drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@benchmarks/src/main/java/org/questdb/VarcharSliceBenchmark.java`:
- Line 85: Update the numeric literal in the JMH `@Param` annotation to use a
grouped numeric literal for readability: change the value in the `@Param`
annotation in VarcharSliceBenchmark (the field annotated with
`@Param`({"500000"})) to use an underscore separator (e.g., "500_000") so it
follows the project's coding guideline for numbers with 5+ digits.
In `@core/rust/qdbr/benches/decode_page.rs`:
- Around line 1857-1877: The loop over varchar_to_dict_pages currently
overwrites data_page on every Page::Data and allows dict to remain None, which
yields incorrect benchmarking; change the loop in the block that builds the
"varchar_dict_s{str_len}_n{null_pct}" case (and the analogous
varchar_slice_dict_* block) to capture the first data page and require a dict:
iterate pages from varchar_to_dict_pages, set dict = Some(p) on the first
Page::Dict and set data_page = Some(p) only if data_page.is_none() on Page::Data
(i.e., keep the first data page encountered), then call build_case with
data_page.expect("data page") and dict.expect("dict page") (or return/error if
dict is missing) so ROW_COUNT reporting aligns with the actual data page used.
In `@core/rust/qdbr/src/parquet_read/column_sink/var.rs`:
- Around line 497-508: The push method in VarcharSliceSpillSink currently builds
the header with (len << 4) using value.len() directly, which can overflow the
28-bit length field; mask or clamp the length to 28 bits before shifting to
preserve the header format. In VarcharSliceSpillSink::push (around the
self.slicer.next() usage and header/combined packing), compute a safe_len = (len
as u32) & 0x0FFF_FFFF (or otherwise cap at 0x0FFF_FFFF) and then form header =
(safe_len << 4) | flags so oversized values do not corrupt the header, keeping
the rest of the spill packing with SPILL_MARKER and offset unchanged.
In `@core/rust/qdbr/src/parquet_read/decode.rs`:
- Around line 1205-1258: Add direct regression tests that exercise
ColumnTypeTag::VarcharSlice through the decode paths shown: DeltaLengthByteArray
(using DeltaLAVarcharSliceDecoder), RleDictionary/PlainDictionary
(RleDictVarcharSliceDecoder), Plain (PlainVarSlicer -> VarcharSliceColumnSink)
and DeltaByteArray (DeltaBytesArraySlicer -> VarcharSliceSpillSink), covering
both FILTERED and FILL_NULLS=true/false permutations; each test should call
decode_page0_mode (or decode_column_chunk for multi-page cases) and assert
correct decoded output and pointer lifetime safety (including the end-of-chunk
fixup behavior and reuse of decompression buffers across pages) so the code
paths and pointer lifetimes in decode.rs are actually exercised.
- Around line 1247-1257: The code constructs a DeltaBytesArraySlicer and
VarcharSliceSpillSink and then calls decode_page0_mode, but
fixup_varchar_slice_spill_pointers is only invoked after a successful return; if
decode_page0_mode errors the spill pointers remain un-fixed. Change the flow in
the DeltaByteArray / VarcharSlice branch so that you capture the result of
decode_page0_mode (e.g. let res = decode_page0_mode::<_, FILTERED,
FILL_NULLS>(...);), then always call fixup_varchar_slice_spill_pointers(&mut
bufs.aux_vec) (or the existing fixup function) before propagating the result
(returning res? or res). Alternatively implement a small scope guard/RAII that
runs fixup_varchar_slice_spill_pointers when the VarcharSliceSpillSink is
dropped; ensure the fixup is invoked unconditionally even when decode_page0_mode
returns Err, referencing DeltaBytesArraySlicer, VarcharSliceSpillSink,
decode_page0_mode, and fixup_varchar_slice_spill_pointers.
In `@core/rust/qdbr/src/parquet_read/decoders/delta_binary_packed.rs`:
- Around line 121-143: The get_end_pointer() implementation currently does
unchecked indexing/slicing (page_data[...] and page_data =
&page_data[miniblock_offset..]) and returns a raw pointer computed from
unchecked offsets; change these to bounds-checked operations: use get() or
checked slicing on page_data before accessing block_bitwidths_offset +
miniblock_index and before slicing with miniblock_offset, validate
zigzag_leb128::decode consumed bytes are within page_data, perform checked
arithmetic when updating miniblock_offset and ensure miniblock_offset <=
original_page_data.len() before computing the pointer, and return a
ParquetResult::Err (e.g., via fmt_err!(Layout, "truncated/corrupt page"))
whenever any bound check fails instead of panicking or producing an
out-of-bounds pointer.
In `@core/rust/qdbr/src/parquet_read/decoders/delta_length_array.rs`:
- Around line 125-135: The decoder must reject string lengths that exceed the
28-bit header capacity before writing VarcharSlice headers; add a check against
MAX_VARCHAR_LEN = (1 << 28) - 1 in advance_data (and the other length-checking
sites referenced around the 239-245 and 347-370 regions) so any len >
MAX_VARCHAR_LEN returns Err(fmt_err!(Layout, "string length exceeds 28-bit
header capacity")), ensure you perform the comparison safely by casting to usize
(or using a const usize) and keep the existing negative/overflow and page-bound
checks in place.
In `@core/rust/qdbr/src/parquet_read/decoders/rle_dict_varchar_slice.rs`:
- Around line 42-46: The loop that builds dict_aux in rle_dict_varchar_slice.rs
packs value length into a 28-bit field (see dict_decoder.dict_values, header and
dict_aux.push), so add an explicit cap check before packing: compute len as u32,
compare to MAX_LEN = 1 << 28 and if len >= MAX_LEN return an error (or propagate
a Result Err) with a clear message about oversized dictionary value instead of
packing it; only proceed to create header and push to dict_aux when len <
MAX_LEN.
In `@core/rust/qdbr/src/parquet_read/row_groups.rs`:
- Around line 580-586: Remove the conditional guard that skips calling
fixup_varchar_slice_spill_pointers when column_chunk_bufs.data_vec.is_empty();
always invoke fixup_varchar_slice_spill_pointers for VarcharSlice DeltaByteArray
pages so SPILL_MARKER entries in aux are cleared even when data_vec.len() == 0.
Locate the two decode paths where column_chunk_bufs.page_buffers is set to
varchar_slice_page_bufs (the block that checks is_varchar_slice and the other
similar branch) and delete the if !column_chunk_bufs.data_vec.is_empty() check
so fixup_varchar_slice_spill_pointers(column_chunk_bufs) runs unconditionally;
optionally add a unit test covering the all-empty-strings DeltaByteArray case to
confirm SPILL_MARKER bytes (4-7) are zeroed.
In `@core/rust/qdbr/src/parquet_write/file.rs`:
- Around line 920-946: The current branch only handles ColumnTypeTag::Varchar
and omits ColumnTypeTag::VarcharSlice, leaving slice-backed varchar columns
unhandled and causing runtime errors; update the condition or add a preceding
branch to also detect ColumnTypeTag::VarcharSlice and route it to the same
varchar dict path (varchar::varchar_to_dict_pages) using the appropriate
auxiliary slice from column.secondary_data, preserving the same bounds logic
that uses column.primary_data, column.column_top, chunk_offset, chunk_length,
options, and primitive_type so both Varchar and VarcharSlice are processed
safely.
In `@core/rust/qdbr/src/parquet_write/varchar.rs`:
- Around line 369-376: The function append_varchar_slice must validate that
value.len() fits in the 28-bit length field before packing into header: compute
len as u64 or usize, check it is <= 0x0FFF_FFFF (28 bits), and if not return an
appropriate ParquetResult::Err with a clear message; only after that cast to u32
and compute header = (len << 4) | (...) as currently done. Update the validation
near the start of append_varchar_slice (before the header calculation and
reserve call) and use the existing ParquetResult error path to signal overflow.
In `@core/rust/qdbr/tests/varchar_slice.rs`:
- Around line 71-73: The contains method currently does unchecked pointer
arithmetic which can overflow; change it to use checked addition: convert len to
u64, call ptr.checked_add(len_u64) and if that returns Some(end) verify ptr >=
self.start && end <= self.end (return false on None) so the range guard cannot
be bypassed by wrapping; update the contains function (referenced by the
contains(&self, ptr: u64, len: usize) -> bool signature) to use checked_add and
short-circuit on overflow before any unsafe dereference.
In `@core/src/main/java/io/questdb/cairo/map/UnorderedVarcharMap.java`:
- Around line 479-484: The hex mask literal in makePackComparable(long
packedHashSizeFlags) should be reformatted with underscores for
readability—replace the current 0x7fffffffffffffffL literal with a grouped
version (e.g., 0x7fff_ffff_ffff_ffffL) while preserving the long suffix and
exact bit pattern so behavior of makePackComparable and any callers remains
unchanged.
In `@core/src/main/java/io/questdb/cairo/VarcharTypeDriver.java`:
- Around line 246-268: Add a defensive check for negative rowNum at the start of
both getSliceValue and getSliceValueSize to avoid auxEntry underflow: in
getSliceValue(long auxAddr, long rowNum, Utf8SplitString utf8SplitView) return
null if rowNum < 0 before computing auxEntry, and in getSliceValueSize(long
auxAddr, long rowNum) return TableUtils.NULL_LEN if rowNum < 0; keep all
existing behavior (VARCHAR_AUX_WIDTH_BYTES, VARCHAR_HEADER_FLAG_NULL, header bit
logic, and Utf8SplitString.of call) otherwise.
In `@core/src/main/java/io/questdb/griffin/SqlCodeGenerator.java`:
- Around line 390-431: The UNION matrix was updated to allow casts involving
VARCHAR_SLICE but generateCastFunctions() lacks handling for fromTag ==
ColumnType.VARCHAR_SLICE in the STRING and VARCHAR branches, so add cases in
generateCastFunctions() to produce cast functions when source is
ColumnType.VARCHAR_SLICE targeting STRING and VARCHAR (and vice-versa if
missing), ensuring the cast table entries match the new matrix; then add
regression tests covering UNION ALL between VARCHAR_SLICE<->VARCHAR and
VARCHAR_SLICE<->STRING and mixed-type INTERSECT/EXCEPT cases to guard against
missing cast implementations.
In `@core/src/main/java/io/questdb/jit/CompiledFilterIRSerializer.java`:
- Around line 607-609: Rename the boolean method TypesObserver.requiresScalar()
to follow the is/has prefix convention (e.g., isScalarRequired() or
hasScalarRequirement()), update all call sites including the call in
CompiledFilterIRSerializer that currently checks typesObserver.requiresScalar(),
and move the renamed method declaration into the proper alphabetical location
among the other instance methods in TypesObserver so member ordering follows the
project's sort rules; ensure method signature and javadoc remain unchanged
except for the name so callers compile cleanly.
In
`@core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java`:
- Around line 596-599: In the test ReadParquetFunctionTest (the SQL passed to
the execute(...) call), replace the SQL cast expression using CAST(NULL AS
VARCHAR) with the preferred QuestDB cast operator NULL::VARCHAR; update the
string inside the execute("CREATE TABLE x AS (SELECT ...") invocation so the
column definition reads NULL::VARCHAR for v to match project cast syntax.
---
Nitpick comments:
In `@core/rust/qdb-core/src/col_driver/mod.rs`:
- Around line 85-88: Add a negative unit test that asserts try_lookup_driver
returns the InvalidType error for ColumnTypeTag::VarcharSlice: update the test
suite that contains test_lookup_driver to include a case calling
try_lookup_driver(ColumnTypeTag::VarcharSlice, ...) and assert it Errs with the
fmt_err InvalidType message (or matches the error variant), so the new rejection
branch in mod.rs is covered; use the same helper/setup used by
test_lookup_driver to obtain any necessary context/arguments.
In `@core/rust/qdbr/benches/decode_page.rs`:
- Around line 603-613: The benchmark currently only generates ASCII payloads via
the values Vec< Vec<u8> > generator and forces ascii: Some(true) in the
varchar_slice and varchar_slice_dict benchmarks, missing non-ASCII decode paths;
update the values generator in decode_page.rs to produce both ASCII and
non-ASCII variants (e.g., include at least one multibyte UTF-8 character such as
"ñ" or a CJK character in some entries or alternate rows) and extend the
benchmark matrix for the dictionary-encoded varchar_slice_dict_* cases to mirror
the same ascii: Some(true/false) variants (i.e., add ascii: Some(false) runs) so
both ASCII and non-ASCII decode paths are exercised. Ensure you modify the
generator that builds values and the varchar_slice_dict_* benchmark
configuration to include non-ASCII cases.
In `@core/rust/qdbr/src/parquet_read/decode.rs`:
- Around line 2376-2381: Multiple tests repeat QdbMetaCol { column_type,
column_top: 0, format: None, ascii: None }; add a small test helper to
centralize construction (e.g., a function like test_qdb_meta_col(column_type:
Type) or a builder like QdbMetaCol::test_with(column_type) in the tests/util
module) that returns a QdbMetaCol with column_top=0, format=None, ascii=None so
tests call that helper instead of repeating the literal; update existing tests
to use the helper to reduce duplication and prevent future field-drift.
In `@core/rust/qdbr/src/parquet_read/row_groups.rs`:
- Around line 130-147: The symbol/varchar/varcharslice compatibility logic is
duplicated across read_parquet, filtered decode, and stats validation (e.g., the
blocks comparing ColumnTypeTag::Symbol, ::Varchar and ::VarcharSlice that
reassign column_type = to_column_type); extract this logic into a single helper
function (e.g., normalize_parquet_column_type or remap_symbol_varchar) that
takes (column_type, to_column_type) and returns the remapped ColumnType; replace
the three duplicated blocks (including the occurrences around read_parquet and
the ranges noted) with calls to that helper so all callers use the same
remapping behavior and avoid drift.
In `@core/rust/qdbr/tests/strings.rs`:
- Around line 178-179: Add executable tests in core/rust/qdbr/tests/strings.rs
that assert attempting to decode ColumnTypeTag::String with
Encoding::RleDictionary and Encoding::DeltaByteArray fails with the expected
“unsupported” error; locate the test module near the existing String tests and
add two cases that call the same decode/reader entry point used elsewhere in
this file (e.g. the routine that maps Encoding -> decoder) and assert it returns
the unsupported/Err variant, or mark each test with #[ignore] and a TODO comment
if the exact error contract is still unstable; reference
Encoding::RleDictionary, Encoding::DeltaByteArray and ColumnTypeTag::String in
the assertions so the test will fail if the unsupported-list comment drifts.
In `@core/src/test/java/io/questdb/test/cairo/map/UnorderedVarcharMapTest.java`:
- Around line 264-292: Add a companion test that exercises MapKey.copyFrom()
when the source key is unstable (unpinned) so the deferred-copy branch in
Key.copyFrom() is covered: create a new test (e.g.,
testDeferredKeyCopyCopyFrom_unstable) mirroring testDeferredKeyCopyCopyFrom but
use the unstable-key insertion helper (the counterpart to putStable, e.g.,
putUnstable or the method that leaves source pointers non-pinned) when
populating mapA, then perform the same copyFrom into mapB (using
UnorderedVarcharMap, MapKey.copyFrom, newDeferredKeyCopyMap, DirectUtf8Sink) and
assert sizes and values as in the existing test to verify correctness.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8d85cb9e-c65d-4e21-a859-5ec0dfa10743
⛔ Files ignored due to path filters (8)
core/src/main/bin/linux-aarch64/libjemalloc.sois excluded by!**/*.socore/src/main/resources/io/questdb/bin/darwin-aarch64/libquestdbr.dylibis excluded by!**/*.dylibcore/src/main/resources/io/questdb/bin/darwin-x86-64/libquestdbr.dylibis excluded by!**/*.dylibcore/src/main/resources/io/questdb/bin/linux-aarch64/libquestdb.sois excluded by!**/*.socore/src/main/resources/io/questdb/bin/linux-aarch64/libquestdbr.sois excluded by!**/*.socore/src/main/resources/io/questdb/bin/linux-x86-64/libquestdbr.sois excluded by!**/*.socore/src/main/resources/io/questdb/bin/windows-x86-64/libquestdb.dllis excluded by!**/*.dllcore/src/main/resources/io/questdb/bin/windows-x86-64/questdbr.dllis excluded by!**/*.dll
📒 Files selected for processing (44)
benchmarks/src/main/java/org/questdb/VarcharSliceBenchmark.javacore/rust/qdb-core/src/col_driver/mod.rscore/rust/qdb-core/src/col_type.rscore/rust/qdbr/benches/decode_page.rscore/rust/qdbr/src/parquet/qdb_metadata.rscore/rust/qdbr/src/parquet_read/column_sink/var.rscore/rust/qdbr/src/parquet_read/decode.rscore/rust/qdbr/src/parquet_read/decoders/delta_binary_packed.rscore/rust/qdbr/src/parquet_read/decoders/delta_length_array.rscore/rust/qdbr/src/parquet_read/decoders/dictionary.rscore/rust/qdbr/src/parquet_read/decoders/mod.rscore/rust/qdbr/src/parquet_read/decoders/plain.rscore/rust/qdbr/src/parquet_read/decoders/rle.rscore/rust/qdbr/src/parquet_read/decoders/rle_dict_varchar_slice.rscore/rust/qdbr/src/parquet_read/decoders/rle_dictionary.rscore/rust/qdbr/src/parquet_read/mod.rscore/rust/qdbr/src/parquet_read/row_groups.rscore/rust/qdbr/src/parquet_read/slicer/mod.rscore/rust/qdbr/src/parquet_write/file.rscore/rust/qdbr/src/parquet_write/mod.rscore/rust/qdbr/src/parquet_write/schema.rscore/rust/qdbr/src/parquet_write/varchar.rscore/rust/qdbr/tests/common/mod.rscore/rust/qdbr/tests/strings.rscore/rust/qdbr/tests/varchar_slice.rscore/src/main/java/io/questdb/cairo/ColumnType.javacore/src/main/java/io/questdb/cairo/VarcharTypeDriver.javacore/src/main/java/io/questdb/cairo/map/MapFactory.javacore/src/main/java/io/questdb/cairo/map/UnorderedVarcharMap.javacore/src/main/java/io/questdb/cairo/sql/PageFrameFilteredMemoryRecord.javacore/src/main/java/io/questdb/cairo/sql/PageFrameMemoryPool.javacore/src/main/java/io/questdb/cairo/sql/PageFrameMemoryRecord.javacore/src/main/java/io/questdb/griffin/FunctionParser.javacore/src/main/java/io/questdb/griffin/SqlCodeGenerator.javacore/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.javacore/src/main/java/io/questdb/griffin/engine/groupby/GroupByRecordCursorFactory.javacore/src/main/java/io/questdb/griffin/engine/table/GroupByMapFragment.javacore/src/main/java/io/questdb/jit/CompiledFilterIRSerializer.javacore/src/main/java/io/questdb/std/str/Utf8SplitString.javacore/src/test/java/io/questdb/test/cairo/ColumnTypeTest.javacore/src/test/java/io/questdb/test/cairo/map/UnorderedVarcharMapTest.javacore/src/test/java/io/questdb/test/griffin/engine/functions/catalogue/TypeOfFunctionFactoryTest.javacore/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.javacore/src/test/java/io/questdb/test/tools/TestUtils.java
core/src/main/java/io/questdb/jit/CompiledFilterIRSerializer.java
Outdated
Show resolved
Hide resolved
core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java
Show resolved
Hide resolved
…rd_parquet_varchar
…pParquetExportTest
…s in LimitedSizeLongTreeChain
… filtered fill-nulls tests
|
@RaphDal I reviewed the PR in detail — memory lifetime, Rust decoders, Java type integration, and parquet compatibility. The pointer lifecycle design is sound (no use-after-free found), and backward/forward compat looks good. A few findings below: Confirmed Issues1. (HIGH) Dictionary size overflow in write path —
|
|
Correction to my review above — after double-checking, I need to retract/downgrade several findings: Retracted (false positives)
Downgraded
Net: no real bugs found. The implementation is solid. Apologies for the false alarms — the initial review was too aggressive on edge cases without properly tracing the data origins. |
[PR Coverage check]😍 pass : 4962 / 5117 (96.97%) file detail
|
…ce for parquet files (#6809)
…ce for parquet files (questdb#6809)
This PR brings performance improvements to varchar decoding from parquet files by adding a new internal column type: varchar slice.
VarcharSliceaux entries store (length, pointer) pairs pointing directly intommappedParquet pages or per-page decompression buffers, eliminating byte copies on the read path (no inlining/asciiness check).Breaking change
VarcharfromDelta Length Byte ArraytoRLE Dictionary.Benchmark