Skip to content

fix(parquet): fix parquet export bug resulting in "more than one dictionary" error#6715

Closed
ideoma wants to merge 5 commits intomasterfrom
fix-parquet-export-symbols
Closed

fix(parquet): fix parquet export bug resulting in "more than one dictionary" error#6715
ideoma wants to merge 5 commits intomasterfrom
fix-parquet-export-symbols

Conversation

@ideoma
Copy link
Copy Markdown
Collaborator

@ideoma ideoma commented Jan 27, 2026

Fixes #6692

Summary

  • Fix Parquet export producing files that fail to read in PyArrow with "Column cannot have more than one dictionary" error
  • Write one dictionary per row group instead of one per partition, ensuring dictionary indices are contiguous from 0 to max_used_key
  • Fix hybrid RLE encoder to properly pad remainder values in bitpacked encoding

Details

Problem: When exporting tables with symbol columns spanning multiple row groups (or multiple partitions merged into one row group), QuestDB was writing multiple dictionary pages per column chunk. This violates the Parquet specification which requires exactly one dictionary per column chunk, causing PyArrow to fail with "Column cannot have more than one dictionary."

Root cause: The previous implementation wrote a dictionary page for each partition's data within a row group, rather than a single unified dictionary for the entire row group.

Solution:

  1. Compute a single dictionary per row group containing symbols from index 0 to the maximum key actually used
  2. Write the dictionary page only once (for the first partition in multi-partition row groups)
  3. Use consistent bit width based on max_key across all partitions in the row group
  4. Fix hybrid RLE encoder to zero-pad unused buffer slots when encoding remainder values

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 27, 2026

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

GitHub Actions - Rebuild Native Libraries and others added 2 commits January 27, 2026 18:11
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 231 / 372 (62.10%)

file detail

path covered line new line coverage
🔵 io/questdb/std/Os.java 0 1 00.00%
🔵 qdbr/src/lib.rs 0 1 00.00%
🔵 qdbr/src/parquet_write/file.rs 154 288 53.47%
🔵 qdbr/src/parquet_write/symbol.rs 74 79 93.67%
🔵 qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs 3 3 100.00%

@questdb-butler
Copy link
Copy Markdown

⚠️ Enterprise CI Failed

The enterprise test suite failed for this PR.

Build: View Details
Tested Commit: b8ae724abff6458af0ecb0d253380ce37b785085

Please investigate the failure before merging.

@bluestreak01
Copy link
Copy Markdown
Member

@CodeRabbit pls review this PR critically

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Jan 28, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@bluestreak01, I'll perform a critical review of this PR. Let me examine the code changes in detail.


🏁 Script executed:

#!/bin/bash
# Get the PR diff to understand the changes
gh pr diff 6715 --repo questdb/questdb > pr_diff.txt
head -2000 pr_diff.txt

Length of output: 47849


🏁 Script executed:

#!/bin/bash
# Get the list of changed files
gh pr view 6715 --repo questdb/questdb --json files --jq '.files[].path'

Length of output: 760


🏁 Script executed:

#!/bin/bash
# Check for test files related to parquet
fd -e rs -e java test | rg -i parquet

Length of output: 771


🏁 Script executed:

#!/bin/bash
# Look at the Enterprise CI failure details
gh pr checks 6715 --repo questdb/questdb

Length of output: 8692


🏁 Script executed:

#!/bin/bash
# Get more details from the PR diff
tail -2000 pr_diff.txt

Length of output: 47849


@bluestreak01, I've performed a critical review of this PR. Here are my findings:

🚨 Critical Issues

1. Debug Statement in Production Code

Line 1273 in file.rs contains a println! that should be removed:

println!(
    "column_chunk_to_pages_with_dict_state: column={}, chunk_offset={}, include_dict={}",
    column.name, chunk_offset, include_dict
);

This will pollute production logs and should either be removed or converted to proper logging using a logging framework.

2. CI Failures Need Investigation

  • Enterprise CI failed - This is blocking and needs investigation
  • Rust Test and Lint failed on linux-jdk17 (8m20s) - Check what tests are failing

3. Low Test Coverage on Critical Code

  • file.rs: 53.47% coverage (154/288 lines) - This file contains the core fix
  • Os.java: 0% coverage (debug print line)
  • lib.rs: 0% coverage (trivial refactor)

The new functions like compute_max_symbol_keys, build_symbol_dict_for_max_key, and compute_row_group_dicts_for_slice need better test coverage.

⚠️ Design & Performance Concerns

4. Memory Impact: Dictionary Cloning

In symbol.rs line 336:

let dict_page = Page::Dict(DictPage::new(
    info.dict_buffer.clone(),  // <-- Full dictionary buffer clone
    info.num_values,
    false,
));

For high-cardinality symbol columns, this could allocate significant memory. Consider whether this clone is necessary or if you can work with references.

5. Complex State Management

The solution introduces:

  • SymbolDictInfo struct with Arc wrapping
  • MultiPartitionColumnIteratorWithDict (renamed from MultiPartitionColumnIterator)
  • Multiple new helper functions with _with_dict_state suffixes

This complexity increases the risk of bugs in edge cases. Have you considered:

  • What happens with empty partitions?
  • Symbol columns with no values (all nulls)?
  • Symbol columns where max_key is at boundary conditions?

6. Error Path Coverage

build_symbol_dict_for_max_key has multiple error conditions:

"Symbol key {} has invalid UTF-16 length"
"Symbol key {} has offset {} outside chars buffer"
"Symbol key {} is outside offsets buffer"

Are these error paths tested? What happens in production if corrupted data triggers these?

7. Unsafe Code in Hot Path

Line 1152 in file.rs:

let k = unsafe { *keys.get_unchecked(i) };

While this may be an optimization, ensure bounds are guaranteed by the loop.

📝 Code Quality Issues

8. Unnecessary Change

In lib.rs, the refactor from:

let result = a + b;
result

to:

a + b

is style-only and adds noise to the PR.

9. Debug Print in Os.java

Line 394:

System.err.println("Loading DEV Rust library: " + devRustLib);

Should this be permanent or was this for debugging?

🧪 Testing Gaps

10. Edge Cases to Verify

The new tests cover basic scenarios, but consider adding tests for:

  • Empty dictionaries (all NULL symbols)
  • Single-value dictionaries (edge case for bit width calculation)
  • Dictionary with gaps (missing symbol keys in the middle)
  • Maximum key at boundary (e.g., exactly at power of 2)
  • Very high cardinality (1M+ unique symbols)
  • Multi-partition row groups with different symbol distributions

Questions

  1. What is causing the Enterprise CI failure? Can you share the specific test failures?
  2. Why is the file size changing in existing tests? (e.g., 51882 → 51540 bytes) - Is this expected compression improvement?
  3. Have you validated with PyArrow specifically? The bug report mentions PyArrow fails - do you have a reproduction that now passes?
  4. Memory profiling: Have you tested memory usage with high-cardinality symbol columns (e.g., 100K unique symbols)?

Recommendations

  1. ✅ Remove the debug println! statement
  2. ✅ Remove or properly guard the debug System.err.println
  3. ✅ Investigate and fix CI failures before merge
  4. ✅ Add tests for edge cases mentioned above
  5. ✅ Consider adding benchmarks for memory usage with high-cardinality symbols
  6. ⚠️ Document the new dictionary management strategy in code comments
  7. ⚠️ Consider adding assertions to verify Parquet spec compliance (one dict per column chunk)

The core fix appears sound, but these issues should be addressed before merging to production.

@ideoma ideoma closed this Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exporting symbol columns to parquet can be incompatible with strict parquet readers

4 participants