Skip to content

fix(core): fix dictionary and bitpack encoding for symbol columns in parquet#6708

Merged
bluestreak01 merged 22 commits intomasterfrom
parquet_encoding_fix
Jan 28, 2026
Merged

fix(core): fix dictionary and bitpack encoding for symbol columns in parquet#6708
bluestreak01 merged 22 commits intomasterfrom
parquet_encoding_fix

Conversation

@kafka1991
Copy link
Copy Markdown
Collaborator

@kafka1991 kafka1991 commented Jan 27, 2026

Fix Parquet export compatibility with strict readers (PyArrow, Trino) for SYMBOL columns spanning multiple row groups.

Fixes #6692

Changes:

  • Generate single dictionary page per column chunk (was incorrectly writing multiple)
  • Fix RLE/bitpack encoding padding
  • Introduce parquet export compat tests(with pyarrow, duck, pandas, polars, pyspark)

Duplicates #6715

@kafka1991 kafka1991 marked this pull request as draft January 27, 2026 05:52
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 27, 2026

Walkthrough

Adds a Parquet compatibility CI workflow and a Python test harness, introduces multiple Parquet-related test requirement files, and implements several Rust changes to symbol handling, dictionary/page construction, and RLE encoding in the Parquet writer, plus related tests and small Java test updates.

Changes

Cohort / File(s) Summary
CI Workflow
.github/workflows/parquet_compat.yml
New GitHub Actions workflow "Parquet Export Compatibility" that builds distribution, starts QuestDB, and runs isolated Python test suites across multiple Parquet reader variants; uploads logs on failure.
Python test requirements
compat/src/test/python/requirements_parquet_*.txt
\compat/src/test/python/requirements_parquet_pyarrow22.txt`, `...pyarrow23.txt`, `...duckdb.txt`, `...pandas.txt`, `...polars.txt`, `...pyspark.txt``
Adds six requirements files pinning/declaring dependencies for pyarrow22, pyarrow23, duckdb, pandas+fastparquet, polars, and pyspark test environments (all include requests>=2.28.0).
Parquet test runner
compat/src/test/python/runner_parquet.py
New Python test harness with HTTP utilities to execute SQL and export Parquet, multiple backend readers (PyArrow, DuckDB, Pandas, Polars, PySpark) with priority detection, and symbol-focused compatibility test cases and CLI entrypoint.
Column type helper
core/rust/qdb-core/src/col_type.rs
Adds public method is_symbol(&self) -> bool to ColumnType.
RLE encoder fix
core/rust/qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs
Fixes size calculations for final packed segment, zero-fills remainder buffer, and adjusts final write slice; test expectations updated for trailing zero.
Parquet writer: API + multi-partition handling
core/rust/qdbr/src/parquet_write/file.rs
Renames set_parallel -> with_parallel; adds multi-partition symbol-column handling producing global dictionary plus per-partition data pages, with parallel/non-parallel code paths and validation for primitive types.
Parquet writer: unit tests
core/rust/qdbr/src/parquet_write/mod.rs
Adds unit tests for writing row groups from multiple partitions with symbol columns (parallel and non-parallel).
Symbol encoding refactor
core/rust/qdbr/src/parquet_write/symbol.rs
Large refactor: introduces SymbolGlobalInfo, separates dictionary construction (build_symbol_dict_page), data-page-only writer, UTF-16→UTF-8 helpers, revised dict buffer building encoding only used keys, updated stats handling and public signatures.
Java test expectations
core/src/test/java/io/questdb/test/cutlass/http/ExpParquetExportTest.java
Adjusts expected Parquet output sizes in export tests to match encoder/dictionary changes.
Workflow action version bumps
.github/workflows/pgwire_latest.yml, .github/workflows/pgwire_stable.yml
Bumps actions/setup-python usage from v2 to v5 in these workflows.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

Bug, SQL

Suggested reviewers

  • puzpuzpuz
  • bluestreak01
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The description clearly relates to the changeset, explaining the bug fix (single dictionary page, RLE/bitpack padding), linking to issue #6692, and listing the test additions (pyarrow, duckdb, pandas, polars, pyspark).
Linked Issues check ✅ Passed The PR addresses all core requirements from issue #6692: fixes dictionary encoding (single page per column chunk), corrects RLE/bitpack padding, and adds comprehensive Parquet compatibility tests with multiple backends.
Out of Scope Changes check ✅ Passed Minor workflow updates to setup-python actions (v2→v5) are slightly tangential but reasonable maintenance; all code changes directly address Parquet symbol encoding issues and their testing.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The PR title 'fix(core): fix dictionary and bitpack encoding for symbol columns in parquet' clearly and concisely describes the main changes—fixing Parquet encoding issues for SYMBOL columns. It matches the primary objective of correcting dictionary page generation and RLE/bitpack encoding for multi-row-group SYMBOL exports.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@kafka1991 kafka1991 changed the title fix(parquet): 1. generate only one dict page for single rowGroup 2. correct bitpack encoding fix(parquet): fix dictionary and bitpack encoding for symbol columns Jan 28, 2026
@kafka1991 kafka1991 marked this pull request as ready for review January 28, 2026 07:19
@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit can you critically review this pr?

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 28, 2026

@bluestreak01, I'll conduct a critical review of this PR. However, I want to highlight some concerns upfront:

Critical Issues to Address:

  1. 🚨 Enterprise CI Failure: The enterprise test suite has failed on commit 360b5415. This is a blocker that must be investigated and resolved before merging. The failure could indicate that the Parquet encoding changes have broken compatibility with enterprise features or downstream consumers.

  2. Test Coverage Concerns:

    • The main file qdbr/src/parquet_write/file.rs has only 56.67% coverage (68/120 lines)
    • Overall PR coverage is 74%, which leaves significant code paths untested
    • Given this PR fixes data serialization correctness issues for strict readers (PyArrow, Trino), the test coverage should be higher to ensure the fix works correctly across different scenarios
  3. Data Correctness Risk: This PR modifies core Parquet encoding logic (dictionary pages, RLE/bitpack encoding). These are critical areas where bugs can lead to:

    • Silent data corruption
    • Incompatibility with readers
    • Loss of data integrity

Recommendations:

  • Investigate and fix the Enterprise CI failure first
  • Add comprehensive tests covering multi-row group scenarios with SYMBOL columns
  • Verify compatibility with PyArrow and Trino explicitly in tests
  • Consider adding integration tests that write and read back the data with external tools

Let me perform a detailed code review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

GitHub Actions - Rebuild Native Libraries and others added 2 commits January 28, 2026 08:36
@kafka1991
Copy link
Copy Markdown
Collaborator Author

kafka1991 commented Jan 28, 2026

The EntCI failure might be caused by the number of bytes asserted in the Parquet assertions, may need @bluestreak01 's help

@kafka1991
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review! Find bugs and such. Pls prompt report back.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 28, 2026

@kafka1991, I'm conducting a critical review of the PR now, focusing on:

🔍 Review Focus Areas:

  • Encoding correctness: Dictionary page generation and RLE/bitpack encoding logic
  • Multi-row group scenarios: Ensuring SYMBOL columns work correctly across row group boundaries
  • Edge cases: Empty dictionaries, maximum symbol counts, padding calculations
  • Memory safety: Buffer handling and allocation patterns in the Rust code
  • Test coverage gaps: Identifying untested code paths, especially in file.rs (56.67% coverage)
  • EntCI failure root cause: Examining byte-level assertions that may have changed

I'll analyze the code changes in detail and report back with any bugs, potential issues, or concerns.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In @.github/workflows/parquet_compat.yml:
- Around line 25-28: Update the GitHub Actions step named "Setup Python version"
to use the latest setup action by replacing uses: actions/setup-python@v2 with
uses: actions/setup-python@v5; ensure the step still supplies the python-version
input (python-version: 3.12) and adjust any v5-specific inputs or syntax if
required so the workflow remains compatible with the newer action.

In `@compat/src/test/python/runner_parquet.py`:
- Line 520: In TestSymbolColumnTop.run(), the unpacked variable num_row_groups
from the call total_rows, num_row_groups = read_parquet(parquet_path) is never
used; change the unpacking to use an intentionally discarded name with an
underscore prefix (e.g., total_rows, _num_row_groups =
read_parquet(parquet_path) or total_rows, _ = read_parquet(parquet_path)) so
static analysis no longer flags an unused variable while keeping intent clear.

In `@core/rust/qdbr/src/parquet_write/symbol.rs`:
- Around line 152-218: The function symbol_to_data_page_only computes
bits_per_key from global_max_key but does not validate individual local keys;
add a guard before pushing bits_per_key/calling encode_u32 to iterate local_keys
(or column_values) and ensure every non-null key <= global_max_key (and >= 0),
returning an Err(ParquetError::... via ParquetResult) if any key violates the
invariant so encoding cannot proceed with an out-of-range value; perform the
check after computing bits_per_key and before encode_u32 to fail fast and
reference symbols: symbol_to_data_page_only, global_max_key, bits_per_key,
local_keys, and encode_u32.
🧹 Nitpick comments (4)
compat/src/test/python/requirements_parquet_duckdb.txt (1)

1-3: Consider pinning pytz version for reproducibility.

The dependencies are appropriate for DuckDB-based Parquet compatibility testing. For more reproducible test environments, consider adding a minimum version constraint to pytz (e.g., pytz>=2023.3), similar to the other packages.

.github/workflows/parquet_compat.yml (1)

29-70: Consider adding a job timeout.

The workflow doesn't specify a timeout, which could be problematic if QuestDB hangs or tests run indefinitely. Consider adding a timeout at the job level.

Suggested addition
 jobs:
   build:
     if: ${{ github.event.pull_request.head.repo.full_name == github.repository }}
     runs-on: ubuntu-latest
+    timeout-minutes: 30
     steps:
core/rust/qdbr/src/parquet_write/file.rs (2)

483-524: Minor inconsistency in partition_ranges construction order.

In the parallel path (lines 397-411), partition_ranges is built inside the symbol column check. In the non-parallel path (lines 485-499), it's built before the check. This is a minor inconsistency but doesn't affect correctness since partition_ranges is used by both the symbol and non-symbol paths in the non-parallel branch.


663-666: Implicit assumption: All partitions share the same symbol table.

The code assumes first_column.symbol_offsets and first_column.secondary_data are identical across all partitions. This is correct for QuestDB's symbol semantics (symbols are table-scoped), but adding a debug assertion or comment would make this assumption explicit.

Optional: Add debug assertion
     // All partitions share the same symbol table
     let first_column = partition_ranges[0].0;
     let offsets = first_column.symbol_offsets;
     let chars = first_column.secondary_data;
+    
+    #[cfg(debug_assertions)]
+    for (col, _, _) in partition_ranges.iter().skip(1) {
+        debug_assert_eq!(col.symbol_offsets.len(), offsets.len(), "Symbol tables must be identical across partitions");
+    }

@kafka1991
Copy link
Copy Markdown
Collaborator Author

@coderabbitai I've addressed your comments, pls review again!

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 28, 2026

@kafka1991, thank you for addressing the feedback! I can see significant improvements:

✅ Coverage Improvements:

  • Overall PR coverage improved from 74% → 92.38% (376/407 lines)
  • qdbr/src/parquet_write/file.rs improved from 56.67% → 84.82% (95/112 lines)
  • Other core files showing 92.82%-100% coverage

This is excellent progress on the test coverage concerns I raised earlier!

⚠️ Remaining Blocker:
However, the Enterprise CI is still failing on commit 71536999206b2c0547622affb7c0e10267ae72b8. As you mentioned, this might be related to byte-count assertions in Parquet tests. Before this can be merged, you'll need to:

  1. Investigate the specific EntCI test failures
  2. Update any byte-count assertions that changed due to the encoding fixes
  3. Verify the byte count changes are expected and correct

I'll trigger a comprehensive review now to examine the latest changes in detail.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@core/rust/qdbr/src/parquet_write/symbol.rs`:
- Around line 249-299: The code reads qdb_utf16_len as an i32 then casts to
usize which can overflow for negative or huge values; update
get_symbol_utf16_bytes to validate the i32 before casting (ensure qdb_utf16_len
>= 0), use checked_mul and checked_add when computing required_len =
UTF16_LEN_SIZE + qdb_utf16_len * 2 and when slicing (avoid panics on overflow),
and return None on any invalid/overflowing length; keep callers
read_symbol_as_utf8 and read_symbol_to_utf8 unchanged (they already handle None)
but ensure get_symbol_utf16_bytes uses UTF16_LEN_SIZE, i32->usize conversion
only after validation, and checked arithmetic to prevent out-of-bounds or
negative-length issues.
🧹 Nitpick comments (1)
compat/src/test/python/runner_parquet.py (1)

271-279: Consider logging teardown exceptions for debugging.

Silently swallowing teardown exceptions can make debugging difficult if tests fail due to leftover state from previous runs. While acceptable for a test harness, logging would help diagnose intermittent failures.

♻️ Optional: Log teardown failures
         finally:
             try:
                 self.teardown()
-            except Exception:
-                pass
+            except Exception as e:
+                print(f"Warning: teardown failed: {e}")

@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 373 / 411 (90.75%)

file detail

path covered line new line coverage
🔵 qdbr/src/lib.rs 0 1 00.00%
🔵 qdbr/src/parquet_write/file.rs 94 112 83.93%
🔵 qdbr/src/parquet_write/symbol.rs 179 198 90.40%
🔵 qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs 7 7 100.00%
🔵 qdb-core/src/col_type.rs 3 3 100.00%
🔵 qdbr/src/parquet_write/mod.rs 90 90 100.00%

@bluestreak01 bluestreak01 changed the title fix(parquet): fix dictionary and bitpack encoding for symbol columns fix(core): fix dictionary and bitpack encoding for symbol columns in parquet Jan 28, 2026
@bluestreak01 bluestreak01 merged commit 9a63678 into master Jan 28, 2026
43 checks passed
@bluestreak01 bluestreak01 deleted the parquet_encoding_fix branch January 28, 2026 13:56
@kafka1991 kafka1991 mentioned this pull request Jan 29, 2026
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exporting symbol columns to parquet can be incompatible with strict parquet readers

4 participants