fix(core): fix dictionary and bitpack encoding for symbol columns in parquet by kafka1991 · Pull Request #6708 · questdb/questdb

kafka1991 · 2026-01-27T05:52:31Z

Fix Parquet export compatibility with strict readers (PyArrow, Trino) for SYMBOL columns spanning multiple row groups.

Fixes #6692

Changes:

Generate single dictionary page per column chunk (was incorrectly writing multiple)
Fix RLE/bitpack encoding padding
Introduce parquet export compat tests(with pyarrow, duck, pandas, polars, pyspark)

Duplicates #6715

… encoding

coderabbitai · 2026-01-27T05:52:41Z

Walkthrough

Adds a Parquet compatibility CI workflow and a Python test harness, introduces multiple Parquet-related test requirement files, and implements several Rust changes to symbol handling, dictionary/page construction, and RLE encoding in the Parquet writer, plus related tests and small Java test updates.

Changes

Cohort / File(s)	Summary
CI Workflow `.github/workflows/parquet_compat.yml`	New GitHub Actions workflow "Parquet Export Compatibility" that builds distribution, starts QuestDB, and runs isolated Python test suites across multiple Parquet reader variants; uploads logs on failure.
Python test requirements `compat/src/test/python/requirements_parquet_*.txt` `\`compat/src/test/python/requirements_parquet_pyarrow22.txt``,` `...pyarrow23.txt``,` `...duckdb.txt``,` `...pandas.txt``,` `...polars.txt``,` `...pyspark.txt``	Adds six requirements files pinning/declaring dependencies for pyarrow22, pyarrow23, duckdb, pandas+fastparquet, polars, and pyspark test environments (all include requests>=2.28.0).
Parquet test runner `compat/src/test/python/runner_parquet.py`	New Python test harness with HTTP utilities to execute SQL and export Parquet, multiple backend readers (PyArrow, DuckDB, Pandas, Polars, PySpark) with priority detection, and symbol-focused compatibility test cases and CLI entrypoint.
Column type helper `core/rust/qdb-core/src/col_type.rs`	Adds public method `is_symbol(&self) -> bool` to ColumnType.
RLE encoder fix `core/rust/qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs`	Fixes size calculations for final packed segment, zero-fills remainder buffer, and adjusts final write slice; test expectations updated for trailing zero.
Parquet writer: API + multi-partition handling `core/rust/qdbr/src/parquet_write/file.rs`	Renames `set_parallel` -> `with_parallel`; adds multi-partition symbol-column handling producing global dictionary plus per-partition data pages, with parallel/non-parallel code paths and validation for primitive types.
Parquet writer: unit tests `core/rust/qdbr/src/parquet_write/mod.rs`	Adds unit tests for writing row groups from multiple partitions with symbol columns (parallel and non-parallel).
Symbol encoding refactor `core/rust/qdbr/src/parquet_write/symbol.rs`	Large refactor: introduces `SymbolGlobalInfo`, separates dictionary construction (`build_symbol_dict_page`), data-page-only writer, UTF-16→UTF-8 helpers, revised dict buffer building encoding only used keys, updated stats handling and public signatures.
Java test expectations `core/src/test/java/io/questdb/test/cutlass/http/ExpParquetExportTest.java`	Adjusts expected Parquet output sizes in export tests to match encoder/dictionary changes.
Workflow action version bumps `.github/workflows/pgwire_latest.yml`, `.github/workflows/pgwire_stable.yml`	Bumps `actions/setup-python` usage from `v2` to `v5` in these workflows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

perf(sql): optimize parquet decode rowgroup performance #6632: Related modifications to Parquet encoding/reader compatibility and symbol handling.
fix(sql): support more Parquet field type combinations in read_parquet #6069: Modifies Parquet symbol/encoding and writer logic overlapping with multi-partition and dictionary changes.
perf(sql): recognize data sorted by timestamp in read_parquet() SQL function #6079: Touches col_type.rs helper additions similar to is_symbol addition.

Suggested labels

Bug, SQL

Suggested reviewers

puzpuzpuz
bluestreak01

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description clearly relates to the changeset, explaining the bug fix (single dictionary page, RLE/bitpack padding), linking to issue `#6692`, and listing the test additions (pyarrow, duckdb, pandas, polars, pyspark).
Linked Issues check	✅ Passed	The PR addresses all core requirements from issue `#6692`: fixes dictionary encoding (single page per column chunk), corrects RLE/bitpack padding, and adds comprehensive Parquet compatibility tests with multiple backends.
Out of Scope Changes check	✅ Passed	Minor workflow updates to setup-python actions (v2→v5) are slightly tangential but reasonable maintenance; all code changes directly address Parquet symbol encoding issues and their testing.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The PR title 'fix(core): fix dictionary and bitpack encoding for symbol columns in parquet' clearly and concisely describes the main changes—fixing Parquet encoding issues for SYMBOL columns. It matches the primary objective of correcting dictionary page generation and RLE/bitpack encoding for multi-row-group SYMBOL exports.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

bluestreak01 · 2026-01-28T08:34:17Z

@CodeRabbit can you critically review this pr?

coderabbitai · 2026-01-28T08:34:50Z

@bluestreak01, I'll conduct a critical review of this PR. However, I want to highlight some concerns upfront:

Critical Issues to Address:

🚨 Enterprise CI Failure: The enterprise test suite has failed on commit 360b5415. This is a blocker that must be investigated and resolved before merging. The failure could indicate that the Parquet encoding changes have broken compatibility with enterprise features or downstream consumers.
Test Coverage Concerns:
- The main file qdbr/src/parquet_write/file.rs has only 56.67% coverage (68/120 lines)
- Overall PR coverage is 74%, which leaves significant code paths untested
- Given this PR fixes data serialization correctness issues for strict readers (PyArrow, Trino), the test coverage should be higher to ensure the fix works correctly across different scenarios
Data Correctness Risk: This PR modifies core Parquet encoding logic (dictionary pages, RLE/bitpack encoding). These are critical areas where bugs can lead to:
- Silent data corruption
- Incompatibility with readers
- Loss of data integrity

Recommendations:

Investigate and fix the Enterprise CI failure first
Add comprehensive tests covering multi-row group scenarios with SYMBOL columns
Verify compatibility with PyArrow and Trino explicitly in tests
Consider adding integration tests that write and read back the data with external tools

Let me perform a detailed code review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

kafka1991 · 2026-01-28T08:54:22Z

The EntCI failure might be caused by the number of bytes asserted in the Parquet assertions, may need @bluestreak01 's help

kafka1991 · 2026-01-28T09:22:14Z

@coderabbitai review! Find bugs and such. Pls prompt report back.

coderabbitai · 2026-01-28T09:22:36Z

@kafka1991, I'm conducting a critical review of the PR now, focusing on:

🔍 Review Focus Areas:

Encoding correctness: Dictionary page generation and RLE/bitpack encoding logic
Multi-row group scenarios: Ensuring SYMBOL columns work correctly across row group boundaries
Edge cases: Empty dictionaries, maximum symbol counts, padding calculations
Memory safety: Buffer handling and allocation patterns in the Rust code
Test coverage gaps: Identifying untested code paths, especially in file.rs (56.67% coverage)
EntCI failure root cause: Examining byte-level assertions that may have changed

I'll analyze the code changes in detail and report back with any bugs, potential issues, or concerns.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In @.github/workflows/parquet_compat.yml:
- Around line 25-28: Update the GitHub Actions step named "Setup Python version"
to use the latest setup action by replacing uses: actions/setup-python@v2 with
uses: actions/setup-python@v5; ensure the step still supplies the python-version
input (python-version: 3.12) and adjust any v5-specific inputs or syntax if
required so the workflow remains compatible with the newer action.

In `@compat/src/test/python/runner_parquet.py`:
- Line 520: In TestSymbolColumnTop.run(), the unpacked variable num_row_groups
from the call total_rows, num_row_groups = read_parquet(parquet_path) is never
used; change the unpacking to use an intentionally discarded name with an
underscore prefix (e.g., total_rows, _num_row_groups =
read_parquet(parquet_path) or total_rows, _ = read_parquet(parquet_path)) so
static analysis no longer flags an unused variable while keeping intent clear.

In `@core/rust/qdbr/src/parquet_write/symbol.rs`:
- Around line 152-218: The function symbol_to_data_page_only computes
bits_per_key from global_max_key but does not validate individual local keys;
add a guard before pushing bits_per_key/calling encode_u32 to iterate local_keys
(or column_values) and ensure every non-null key <= global_max_key (and >= 0),
returning an Err(ParquetError::... via ParquetResult) if any key violates the
invariant so encoding cannot proceed with an out-of-range value; perform the
check after computing bits_per_key and before encode_u32 to fail fast and
reference symbols: symbol_to_data_page_only, global_max_key, bits_per_key,
local_keys, and encode_u32.

🧹 Nitpick comments (4)

compat/src/test/python/requirements_parquet_duckdb.txt (1)

1-3: Consider pinning pytz version for reproducibility.

The dependencies are appropriate for DuckDB-based Parquet compatibility testing. For more reproducible test environments, consider adding a minimum version constraint to pytz (e.g., pytz>=2023.3), similar to the other packages.
.github/workflows/parquet_compat.yml (1)
29-70: Consider adding a job timeout.

The workflow doesn't specify a timeout, which could be problematic if QuestDB hangs or tests run indefinitely. Consider adding a timeout at the job level.
Suggested addition
 jobs:
   build:
     if: ${{ github.event.pull_request.head.repo.full_name == github.repository }}
     runs-on: ubuntu-latest
+    timeout-minutes: 30
     steps:
core/rust/qdbr/src/parquet_write/file.rs (2)
483-524: Minor inconsistency in partition_ranges construction order.

In the parallel path (lines 397-411), partition_ranges is built inside the symbol column check. In the non-parallel path (lines 485-499), it's built before the check. This is a minor inconsistency but doesn't affect correctness since partition_ranges is used by both the symbol and non-symbol paths in the non-parallel branch.

663-666: Implicit assumption: All partitions share the same symbol table.

The code assumes first_column.symbol_offsets and first_column.secondary_data are identical across all partitions. This is correct for QuestDB's symbol semantics (symbols are table-scoped), but adding a debug assertion or comment would make this assumption explicit.
Optional: Add debug assertion
     // All partitions share the same symbol table
     let first_column = partition_ranges[0].0;
     let offsets = first_column.symbol_offsets;
     let chars = first_column.secondary_data;
+    
+    #[cfg(debug_assertions)]
+    for (col, _, _) in partition_ranges.iter().skip(1) {
+        debug_assert_eq!(col.symbol_offsets.len(), offsets.len(), "Symbol tables must be identical across partitions");
+    }

.github/workflows/parquet_compat.yml

compat/src/test/python/runner_parquet.py

core/rust/qdbr/src/parquet_write/symbol.rs

kafka1991 · 2026-01-28T10:01:02Z

@coderabbitai I've addressed your comments, pls review again!

coderabbitai · 2026-01-28T10:01:27Z

@kafka1991, thank you for addressing the feedback! I can see significant improvements:

✅ Coverage Improvements:

Overall PR coverage improved from 74% → 92.38% (376/407 lines)
qdbr/src/parquet_write/file.rs improved from 56.67% → 84.82% (95/112 lines)
Other core files showing 92.82%-100% coverage

This is excellent progress on the test coverage concerns I raised earlier!

⚠️ Remaining Blocker:
However, the Enterprise CI is still failing on commit 71536999206b2c0547622affb7c0e10267ae72b8. As you mentioned, this might be related to byte-count assertions in Parquet tests. Before this can be merged, you'll need to:

Investigate the specific EntCI test failures
Update any byte-count assertions that changed due to the encoding fixes
Verify the byte count changes are expected and correct

I'll trigger a comprehensive review now to examine the latest changes in detail.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@core/rust/qdbr/src/parquet_write/symbol.rs`:
- Around line 249-299: The code reads qdb_utf16_len as an i32 then casts to
usize which can overflow for negative or huge values; update
get_symbol_utf16_bytes to validate the i32 before casting (ensure qdb_utf16_len
>= 0), use checked_mul and checked_add when computing required_len =
UTF16_LEN_SIZE + qdb_utf16_len * 2 and when slicing (avoid panics on overflow),
and return None on any invalid/overflowing length; keep callers
read_symbol_as_utf8 and read_symbol_to_utf8 unchanged (they already handle None)
but ensure get_symbol_utf16_bytes uses UTF16_LEN_SIZE, i32->usize conversion
only after validation, and checked arithmetic to prevent out-of-bounds or
negative-length issues.

🧹 Nitpick comments (1)

compat/src/test/python/runner_parquet.py (1)
271-279: Consider logging teardown exceptions for debugging.

Silently swallowing teardown exceptions can make debugging difficult if tests fail due to leftover state from previous runs. While acceptable for a test harness, logging would help diagnose intermittent failures.
♻️ Optional: Log teardown failures
         finally:
             try:
                 self.teardown()
-            except Exception:
-                pass
+            except Exception as e:
+                print(f"Warning: teardown failed: {e}")

core/rust/qdbr/src/parquet_write/symbol.rs

core/rust/qdbr/src/parquet_write/file.rs

core/rust/qdbr/src/parquet_write/symbol.rs

glasstiger · 2026-01-28T13:02:49Z

[PR Coverage check]

😍 pass : 373 / 411 (90.75%)

file detail

	path	covered line	new line	coverage
🔵	qdbr/src/lib.rs	0	1	00.00%
🔵	qdbr/src/parquet_write/file.rs	94	112	83.93%
🔵	qdbr/src/parquet_write/symbol.rs	179	198	90.40%
🔵	qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs	7	7	100.00%
🔵	qdb-core/src/col_type.rs	3	3	100.00%
🔵	qdbr/src/parquet_write/mod.rs	90	90	100.00%

1. generate only one dict page for single rowGroup 2. correct bitpack…

c7c7367

… encoding

kafka1991 marked this pull request as draft January 27, 2026 05:52

kafka1991 and others added 5 commits January 28, 2026 12:40

add parquet compatiable tests

0218d08

merge fallout

672e6cc

fix symbol export

6b262fd

cargo fmt

9a5e9fd

Rebuild Rust libraries

917c089

kafka1991 changed the title ~~fix(parquet): 1. generate only one dict page for single rowGroup 2. correct bitpack encoding~~ fix(parquet): fix dictionary and bitpack encoding for symbol columns Jan 28, 2026

kafka1991 and others added 2 commits January 28, 2026 15:14

add more parquet compat tests

a0385d5

Rebuild Rust libraries

360b541

kafka1991 marked this pull request as ready for review January 28, 2026 07:19

code review

d886307

GitHub Actions - Rebuild Native Libraries and others added 2 commits January 28, 2026 08:36

Rebuild Rust libraries

1f8ebd0

add statistic null count tests

7153699

kafka1991 and others added 2 commits January 28, 2026 17:15

add parallel tests

3658189

Rebuild Rust libraries

4ca97e3

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

.github/workflows/parquet_compat.yml Show resolved Hide resolved

compat/src/test/python/runner_parquet.py Show resolved Hide resolved

core/rust/qdbr/src/parquet_write/symbol.rs Show resolved Hide resolved

kafka1991 and others added 2 commits January 28, 2026 17:55

code review

465b90b

Rebuild Rust libraries

721b91b

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

core/rust/qdbr/src/parquet_write/symbol.rs Show resolved Hide resolved

kafka1991 added 2 commits January 28, 2026 18:28

copy test from #6715

d1d62bc

code lint

4aced81

bluestreak01 and others added 2 commits January 28, 2026 10:49

Merge branch 'master' into parquet_encoding_fix

aeeff94

more parquet compat tests

00f922e

ideoma reviewed Jan 28, 2026

View reviewed changes

core/rust/qdbr/src/parquet_write/file.rs Outdated Show resolved Hide resolved

core/rust/qdbr/src/parquet_write/symbol.rs Show resolved Hide resolved

core/rust/qdbr/src/parquet_write/symbol.rs Outdated Show resolved Hide resolved

ideoma and others added 3 commits January 28, 2026 12:28

address nits, cargo fmt, clippy

03873ff

fix qodana in compat

af54bb5

Rebuild Rust libraries

31777af

ideoma approved these changes Jan 28, 2026

View reviewed changes

bluestreak01 changed the title ~~fix(parquet): fix dictionary and bitpack encoding for symbol columns~~ fix(core): fix dictionary and bitpack encoding for symbol columns in parquet Jan 28, 2026

bluestreak01 merged commit 9a63678 into master Jan 28, 2026
43 checks passed

bluestreak01 deleted the parquet_encoding_fix branch January 28, 2026 13:56

kafka1991 mentioned this pull request Jan 29, 2026

Parquet Support Tasks & TODO #6694

Open

19 tasks

Conversation

kafka1991 commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

bluestreak01 commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

kafka1991 commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kafka1991 commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kafka1991 commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

glasstiger commented Jan 28, 2026

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kafka1991 commented Jan 27, 2026 •

edited

Loading

coderabbitai bot commented Jan 27, 2026 •

edited

Loading

kafka1991 commented Jan 28, 2026 •

edited

Loading