fix(core): fix dictionary and bitpack encoding for symbol columns in parquet#6708
fix(core): fix dictionary and bitpack encoding for symbol columns in parquet#6708bluestreak01 merged 22 commits intomasterfrom
Conversation
WalkthroughAdds a Parquet compatibility CI workflow and a Python test harness, introduces multiple Parquet-related test requirement files, and implements several Rust changes to symbol handling, dictionary/page construction, and RLE encoding in the Parquet writer, plus related tests and small Java test updates. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@CodeRabbit can you critically review this pr? |
|
Critical Issues to Address:
Recommendations:
Let me perform a detailed code review now. ✅ Actions performedReview triggered.
|
|
The EntCI failure might be caused by the number of bytes asserted in the Parquet assertions, may need @bluestreak01 's help |
|
@coderabbitai review! Find bugs and such. Pls prompt report back. |
|
🔍 Review Focus Areas:
I'll analyze the code changes in detail and report back with any bugs, potential issues, or concerns. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In @.github/workflows/parquet_compat.yml:
- Around line 25-28: Update the GitHub Actions step named "Setup Python version"
to use the latest setup action by replacing uses: actions/setup-python@v2 with
uses: actions/setup-python@v5; ensure the step still supplies the python-version
input (python-version: 3.12) and adjust any v5-specific inputs or syntax if
required so the workflow remains compatible with the newer action.
In `@compat/src/test/python/runner_parquet.py`:
- Line 520: In TestSymbolColumnTop.run(), the unpacked variable num_row_groups
from the call total_rows, num_row_groups = read_parquet(parquet_path) is never
used; change the unpacking to use an intentionally discarded name with an
underscore prefix (e.g., total_rows, _num_row_groups =
read_parquet(parquet_path) or total_rows, _ = read_parquet(parquet_path)) so
static analysis no longer flags an unused variable while keeping intent clear.
In `@core/rust/qdbr/src/parquet_write/symbol.rs`:
- Around line 152-218: The function symbol_to_data_page_only computes
bits_per_key from global_max_key but does not validate individual local keys;
add a guard before pushing bits_per_key/calling encode_u32 to iterate local_keys
(or column_values) and ensure every non-null key <= global_max_key (and >= 0),
returning an Err(ParquetError::... via ParquetResult) if any key violates the
invariant so encoding cannot proceed with an out-of-range value; perform the
check after computing bits_per_key and before encode_u32 to fail fast and
reference symbols: symbol_to_data_page_only, global_max_key, bits_per_key,
local_keys, and encode_u32.
🧹 Nitpick comments (4)
compat/src/test/python/requirements_parquet_duckdb.txt (1)
1-3: Consider pinning pytz version for reproducibility.The dependencies are appropriate for DuckDB-based Parquet compatibility testing. For more reproducible test environments, consider adding a minimum version constraint to
pytz(e.g.,pytz>=2023.3), similar to the other packages..github/workflows/parquet_compat.yml (1)
29-70: Consider adding a job timeout.The workflow doesn't specify a timeout, which could be problematic if QuestDB hangs or tests run indefinitely. Consider adding a timeout at the job level.
Suggested addition
jobs: build: if: ${{ github.event.pull_request.head.repo.full_name == github.repository }} runs-on: ubuntu-latest + timeout-minutes: 30 steps:core/rust/qdbr/src/parquet_write/file.rs (2)
483-524: Minor inconsistency in partition_ranges construction order.In the parallel path (lines 397-411),
partition_rangesis built inside the symbol column check. In the non-parallel path (lines 485-499), it's built before the check. This is a minor inconsistency but doesn't affect correctness sincepartition_rangesis used by both the symbol and non-symbol paths in the non-parallel branch.
663-666: Implicit assumption: All partitions share the same symbol table.The code assumes
first_column.symbol_offsetsandfirst_column.secondary_dataare identical across all partitions. This is correct for QuestDB's symbol semantics (symbols are table-scoped), but adding a debug assertion or comment would make this assumption explicit.Optional: Add debug assertion
// All partitions share the same symbol table let first_column = partition_ranges[0].0; let offsets = first_column.symbol_offsets; let chars = first_column.secondary_data; + + #[cfg(debug_assertions)] + for (col, _, _) in partition_ranges.iter().skip(1) { + debug_assert_eq!(col.symbol_offsets.len(), offsets.len(), "Symbol tables must be identical across partitions"); + }
|
@coderabbitai I've addressed your comments, pls review again! |
|
✅ Coverage Improvements:
This is excellent progress on the test coverage concerns I raised earlier!
I'll trigger a comprehensive review now to examine the latest changes in detail. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@core/rust/qdbr/src/parquet_write/symbol.rs`:
- Around line 249-299: The code reads qdb_utf16_len as an i32 then casts to
usize which can overflow for negative or huge values; update
get_symbol_utf16_bytes to validate the i32 before casting (ensure qdb_utf16_len
>= 0), use checked_mul and checked_add when computing required_len =
UTF16_LEN_SIZE + qdb_utf16_len * 2 and when slicing (avoid panics on overflow),
and return None on any invalid/overflowing length; keep callers
read_symbol_as_utf8 and read_symbol_to_utf8 unchanged (they already handle None)
but ensure get_symbol_utf16_bytes uses UTF16_LEN_SIZE, i32->usize conversion
only after validation, and checked arithmetic to prevent out-of-bounds or
negative-length issues.
🧹 Nitpick comments (1)
compat/src/test/python/runner_parquet.py (1)
271-279: Consider logging teardown exceptions for debugging.Silently swallowing teardown exceptions can make debugging difficult if tests fail due to leftover state from previous runs. While acceptable for a test harness, logging would help diagnose intermittent failures.
♻️ Optional: Log teardown failures
finally: try: self.teardown() - except Exception: - pass + except Exception as e: + print(f"Warning: teardown failed: {e}")
[PR Coverage check]😍 pass : 373 / 411 (90.75%) file detail
|
Fix Parquet export compatibility with strict readers (PyArrow, Trino) for SYMBOL columns spanning multiple row groups.
Fixes #6692
Changes:
pyarrow,duck,pandas,polars,pyspark)Duplicates #6715