fix(parquet): fix parquet export bug resulting in "more than one dictionary" error by ideoma · Pull Request #6715 · questdb/questdb

ideoma · 2026-01-27T18:01:39Z

Summary

Fix Parquet export producing files that fail to read in PyArrow with "Column cannot have more than one dictionary" error
Write one dictionary per row group instead of one per partition, ensuring dictionary indices are contiguous from 0 to max_used_key
Fix hybrid RLE encoder to properly pad remainder values in bitpacked encoding

Details

Problem: When exporting tables with symbol columns spanning multiple row groups (or multiple partitions merged into one row group), QuestDB was writing multiple dictionary pages per column chunk. This violates the Parquet specification which requires exactly one dictionary per column chunk, causing PyArrow to fail with "Column cannot have more than one dictionary."

Root cause: The previous implementation wrote a dictionary page for each partition's data within a row group, rather than a single unified dictionary for the entire row group.

Solution:

Compute a single dictionary per row group containing symbols from index 0 to the maximum key actually used
Write the dictionary page only once (for the first partition in multi-partition row groups)
Use consistent bit width based on max_key across all partitions in the row group
Fix hybrid RLE encoder to zero-pad unused buffer slots when encoding remainder values

…ionary" error

…symbols

coderabbitai · 2026-01-27T18:01:48Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

glasstiger · 2026-01-27T19:07:59Z

[PR Coverage check]

😍 pass : 231 / 372 (62.10%)

file detail

	path	covered line	new line	coverage
🔵	io/questdb/std/Os.java	0	1	00.00%
🔵	qdbr/src/lib.rs	0	1	00.00%
🔵	qdbr/src/parquet_write/file.rs	154	288	53.47%
🔵	qdbr/src/parquet_write/symbol.rs	74	79	93.67%
🔵	qdbr/parquet2/src/encoding/hybrid_rle/encoder.rs	3	3	100.00%

questdb-butler · 2026-01-27T19:19:31Z

⚠️ Enterprise CI Failed

The enterprise test suite failed for this PR.

Build: View Details
Tested Commit: b8ae724abff6458af0ecb0d253380ce37b785085

Please investigate the failure before merging.

bluestreak01 · 2026-01-28T10:50:14Z

@CodeRabbit pls review this PR critically

coderabbitai · 2026-01-28T10:52:09Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@bluestreak01, I'll perform a critical review of this PR. Let me examine the code changes in detail.

🏁 Script executed:

#!/bin/bash
# Get the PR diff to understand the changes
gh pr diff 6715 --repo questdb/questdb > pr_diff.txt
head -2000 pr_diff.txt

Length of output: 47849

🏁 Script executed:

#!/bin/bash
# Get the list of changed files
gh pr view 6715 --repo questdb/questdb --json files --jq '.files[].path'

Length of output: 760

🏁 Script executed:

#!/bin/bash
# Check for test files related to parquet
fd -e rs -e java test | rg -i parquet

Length of output: 771

🏁 Script executed:

#!/bin/bash
# Look at the Enterprise CI failure details
gh pr checks 6715 --repo questdb/questdb

Length of output: 8692

🏁 Script executed:

#!/bin/bash
# Get more details from the PR diff
tail -2000 pr_diff.txt

Length of output: 47849

@bluestreak01, I've performed a critical review of this PR. Here are my findings:

🚨 Critical Issues

1. Debug Statement in Production Code

Line 1273 in file.rs contains a println! that should be removed:

println!(
    "column_chunk_to_pages_with_dict_state: column={}, chunk_offset={}, include_dict={}",
    column.name, chunk_offset, include_dict
);

This will pollute production logs and should either be removed or converted to proper logging using a logging framework.

2. CI Failures Need Investigation

Enterprise CI failed - This is blocking and needs investigation
Rust Test and Lint failed on linux-jdk17 (8m20s) - Check what tests are failing

3. Low Test Coverage on Critical Code

file.rs: 53.47% coverage (154/288 lines) - This file contains the core fix
Os.java: 0% coverage (debug print line)
lib.rs: 0% coverage (trivial refactor)

The new functions like compute_max_symbol_keys, build_symbol_dict_for_max_key, and compute_row_group_dicts_for_slice need better test coverage.

⚠️ Design & Performance Concerns

4. Memory Impact: Dictionary Cloning

In symbol.rs line 336:

let dict_page = Page::Dict(DictPage::new(
    info.dict_buffer.clone(),  // <-- Full dictionary buffer clone
    info.num_values,
    false,
));

For high-cardinality symbol columns, this could allocate significant memory. Consider whether this clone is necessary or if you can work with references.

5. Complex State Management

The solution introduces:

SymbolDictInfo struct with Arc wrapping
MultiPartitionColumnIteratorWithDict (renamed from MultiPartitionColumnIterator)
Multiple new helper functions with _with_dict_state suffixes

This complexity increases the risk of bugs in edge cases. Have you considered:

What happens with empty partitions?
Symbol columns with no values (all nulls)?
Symbol columns where max_key is at boundary conditions?

6. Error Path Coverage

build_symbol_dict_for_max_key has multiple error conditions:

"Symbol key {} has invalid UTF-16 length"
"Symbol key {} has offset {} outside chars buffer"
"Symbol key {} is outside offsets buffer"

Are these error paths tested? What happens in production if corrupted data triggers these?

7. Unsafe Code in Hot Path

Line 1152 in file.rs:

let k = unsafe { *keys.get_unchecked(i) };

While this may be an optimization, ensure bounds are guaranteed by the loop.

📝 Code Quality Issues

8. Unnecessary Change

In lib.rs, the refactor from:

let result = a + b;
result

to:

a + b

is style-only and adds noise to the PR.

9. Debug Print in Os.java

Line 394:

System.err.println("Loading DEV Rust library: " + devRustLib);

Should this be permanent or was this for debugging?

🧪 Testing Gaps

10. Edge Cases to Verify

The new tests cover basic scenarios, but consider adding tests for:

Empty dictionaries (all NULL symbols)
Single-value dictionaries (edge case for bit width calculation)
Dictionary with gaps (missing symbol keys in the middle)
Maximum key at boundary (e.g., exactly at power of 2)
Very high cardinality (1M+ unique symbols)
Multi-partition row groups with different symbol distributions

Questions

What is causing the Enterprise CI failure? Can you share the specific test failures?
Why is the file size changing in existing tests? (e.g., 51882 → 51540 bytes) - Is this expected compression improvement?
Have you validated with PyArrow specifically? The bug report mentions PyArrow fails - do you have a reproduction that now passes?
Memory profiling: Have you tested memory usage with high-cardinality symbol columns (e.g., 100K unique symbols)?

Recommendations

✅ Remove the debug println! statement
✅ Remove or properly guard the debug System.err.println
✅ Investigate and fix CI failures before merge
✅ Add tests for edge cases mentioned above
✅ Consider adding benchmarks for memory usage with high-cardinality symbols
⚠️ Document the new dictionary management strategy in code comments
⚠️ Consider adding assertions to verify Parquet spec compliance (one dict per column chunk)

The core fix appears sound, but these issues should be addressed before merging to production.

ideoma added 3 commits January 27, 2026 17:55

fix(parquet): fix parquet export bug resulting in "more than one dict…

149319b

…ionary" error

cargo fmt, clippy

3139335

Merge remote-tracking branch 'origin/master' into fix-parquet-export-…

654e139

…symbols

GitHub Actions - Rebuild Native Libraries and others added 2 commits January 27, 2026 18:11

Rebuild Rust libraries

1502d5f

fix java formatting

b8ae724

kafka1991 mentioned this pull request Jan 28, 2026

fix(core): fix dictionary and bitpack encoding for symbol columns in parquet #6708

Merged

kafka1991 added a commit that referenced this pull request Jan 28, 2026

copy test from #6715

d1d62bc

ideoma closed this Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parquet): fix parquet export bug resulting in "more than one dictionary" error#6715

fix(parquet): fix parquet export bug resulting in "more than one dictionary" error#6715
ideoma wants to merge 5 commits intomasterfrom
fix-parquet-export-symbols

ideoma commented Jan 27, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Jan 27, 2026

Review skipped

Uh oh!

glasstiger commented Jan 27, 2026

Uh oh!

questdb-butler commented Jan 27, 2026

Uh oh!

bluestreak01 commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ideoma commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Solution:

Uh oh!

coderabbitai bot commented Jan 27, 2026

Review skipped

Uh oh!

glasstiger commented Jan 27, 2026

[PR Coverage check]

file detail

Uh oh!

questdb-butler commented Jan 27, 2026

⚠️ Enterprise CI Failed

Uh oh!

bluestreak01 commented Jan 28, 2026

Uh oh!

coderabbitai bot commented Jan 28, 2026

🚨 Critical Issues

1. Debug Statement in Production Code

2. CI Failures Need Investigation

3. Low Test Coverage on Critical Code

⚠️ Design & Performance Concerns

4. Memory Impact: Dictionary Cloning

5. Complex State Management

6. Error Path Coverage

7. Unsafe Code in Hot Path

📝 Code Quality Issues

8. Unnecessary Change

9. Debug Print in Os.java

🧪 Testing Gaps

10. Edge Cases to Verify

Questions

Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ideoma commented Jan 27, 2026 •

edited

Loading