fix(sql): fix read_parquet on Parquet files with stale QDB metadata by nwoolmer · Pull Request #6885 · questdb/questdb

nwoolmer · 2026-03-18T19:19:07Z

This originated from parquet files being exported from QuestDB and later hive-partitioned. A follow-up is in progress to recover the original (missing) columns present in the metadata from the hive paths, which overlaps with this stale PR: #6369

Summary

read_parquet failed with "encoding not supported, physical type: Double, encoding RleDictionary, column type ColumnType(26/Varchar)" when reading Parquet files whose embedded QDB metadata schema had a different number of columns than the actual Parquet data
This happens when an external tool (DuckDB, Spark, PyArrow) rewrites a QuestDB-exported Parquet file — e.g. dropping partition columns for hive-style directory layouts — but preserves the original key-value metadata
The decoder now compares the QDB metadata schema length against the Parquet column count and discards the metadata when they differ, falling back to physical type inference

Test plan

SQL logic test stale_qdb_metadata.test reads a 2-column Parquet file (stale_qdb_meta.parquet) whose footer contains stale 3-column QDB metadata, and verifies read_parquet discards the mismatched metadata, infers types from Parquet physical types, and returns correct data
Reverting the one-line fix and rebuilding the Rust binary reproduces the original error: "encoding not supported, physical type: Double, encoding RleDictionary, column type ColumnType(40/VarcharSlice)"
All existing parquet SQL logic tests pass (46 tests, parallel and sequential)
All existing Rust unit and integration tests pass

🤖 Generated with Claude Code

coderabbitai · 2026-03-18T19:19:13Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a01ccb10-82ab-438c-90d5-5b719dde7123

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-stale-qdb-parquet-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

When an external tool (DuckDB, Spark, PyArrow) rewrites a QuestDB-exported Parquet file — e.g. dropping partition columns for hive-style directory layouts — it may preserve the original key-value metadata while changing the actual column set. The embedded QDB metadata schema then has more entries than Parquet columns, causing positional index lookups to return wrong types. This produced errors like: "encoding not supported, physical type: Double, encoding RleDictionary, ... column type ColumnType(26/Varchar)" The fix compares the QDB metadata schema length against the Parquet column count and discards the metadata when they differ, falling back to physical type inference. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

bluestreak01 · 2026-03-18T19:46:06Z

@nwoolmer

Code Review

Critical

1. Test never actually injects stale metadata — passes vacuously
compat/src/test/java/io/questdb/compat/ParquetStaleMetadataTest.java:243-259

The 3-column stale metadata JSON is always longer than the 2-column original JSON:

Original (2 cols): {"version":1,"schema":[{"column_type":775,"column_top":0},{"column_type":10,"column_top":0}]} (~93 chars)
Stale (3 cols): {"version":1,"schema":[{"column_type":775,"column_top":0},{"column_type":26,"column_top":0},{"column_type":10,"column_top":0}]} (~127 chars)

The padding loop at line 249 only runs when staleMetaJson.length() < originalJson.length() (stale is shorter), which is the opposite of this case. The stale string is ~34 chars longer, so the loop body never executes. Then line 254 fires: staleMetaJson.length() > originalJson.length() is true, and the method returns at line 258 without injecting anything. The test then reads an unmodified parquet file, which read_parquet handles correctly regardless of the fix.

Additionally, if the HTTP /exp?fmt=parquet path does not embed QDB metadata (the BaseParquetExporter/HTTPSerialParquetExporter classes show no reference to QDB_META_KEY or QDB metadata), the first early-return at line 210-214 fires even before reaching the length comparison.

The test passes on CI today but never exercises the bug scenario. It would pass identically with or without the Rust fix.

Suggested fix: Reverse the injection direction — start with the 3-column file (which has longer QDB metadata) and replace it with a 2-column stale schema (shorter, so padding with spaces works). Or use QuestDB's PartitionEncoder directly to produce a file with controlled metadata, or write a Rust-level unit test in meta.rs that constructs a FileMetaData with a mismatched QDB schema.

2. Test assertions are too weak — only checks non-null
compat/src/test/java/io/questdb/compat/ParquetStaleMetadataTest.java:168

Assert.assertNotNull("read_parquet should return data", result);

Even if injection succeeded, this only verifies the HTTP response is non-null. It does not verify:

Column count is 2 (not 3)
Column types are correct (TIMESTAMP, DOUBLE — not TIMESTAMP, VARCHAR, DOUBLE)
Actual data values

The entire point of the fix is that stale 3-column metadata causes Double data to be decoded as VARCHAR. The assertion must verify the schema and data are correct.

Moderate

3. Silent early returns should be test failures
compat/src/test/java/io/questdb/compat/ParquetStaleMetadataTest.java:213,257

Both early-return paths (qdbIdx < 0 at line 213, and metadata too long at line 257) log a message but silently return success. These should be Assert.fail("...") to ensure the test fails loudly if the metadata injection setup is broken, rather than producing false green CI.

4. Missing test for the reverse mismatch direction

The Rust fix discards QDB metadata whenever m.schema.len() != col_len. This covers both "more metadata columns than data columns" and "fewer metadata columns than data columns". Only the former is tested (attempted). The latter scenario (external tool adds columns to the Parquet data but preserves original narrower metadata) should also be tested.

5. No assertMemoryLeak() wrapper

The test allocates native memory through PartitionEncoder, PartitionDescriptor, TableReader, and Path. Per coding standards, tests that allocate native memory should use assertMemoryLeak(). This test would also be more robust as a unit test in ReadParquetFunctionTest where assertMemoryLeak() and assertQueryNoLeakCheck() are standard, and the binary-patching approach becomes unnecessary.

Minor

6. SQL statements use string concatenation instead of text blocks (""")

7. PR title is 71 characters (convention says under 70). Could shorten to fix(sql): fix read_parquet on Parquet files with stale QDB metadata (68 chars).

Production code verdict

The Rust fix itself is correct, safe, and minimal:

let qdb_meta = qdb_meta.filter(|m| m.schema.len() == col_len);

All downstream code handles None gracefully, falling back to Parquet-native type inference
No panic risk, no resource leak, no concurrency issue
No false positives on valid QDB-written files (schema length always matches for QDB-authored Parquet)

Summary

Verdict: Request changes — the production fix is correct, but the test does not exercise it. The test needs to be rewritten to guarantee metadata injection succeeds and to assert on the actual query results (column count, types, values).

The original test was a no-op: the stale 3-column JSON was always longer than the original 2-column JSON, so the binary replacement silently returned without modifying the file. Flip the approach: export a 3-column file (3-column QDB metadata), then replace the metadata with a shorter 2-column version padded to the same byte length. The schema length mismatch (2 != 3) triggers the fix, which discards the stale metadata and falls back to physical type inference. Also fix: - Wrong column type constant (775 -> ColumnType.TIMESTAMP | 1<<17) - Replace silent early returns with Assert.fail() - Add real assertions on column names, data values, and column count - Add comment documenting the leaf-count assumption in meta.rs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

nwoolmer · 2026-03-18T20:17:03Z

@bluestreak01 All findings addressed in f3e99a01 and b1415caa9:

Critical fixes:

Test no longer a no-op — flipped the injection direction. The test now exports a 3-column file (longer metadata), then replaces the 3-column QDB metadata with a shorter 2-column version padded to the same byte length. The injection succeeds reliably and the schema length mismatch (2 != 3) exercises the Rust fix.
Column type constant fixed — 775 replaced with ColumnType.TIMESTAMP | (1 << 17) = 131080.
Assertions strengthened — now verifies column names (ts, category, value), data value (1.5), row count ("count":10), and column count (exactly 3).
Silent early returns eliminated — all guard conditions use Assert.assertTrue / Assert.assertEquals.

Moderate fixes:
5. SQL uses multiline strings (""") per CLAUDE.md conventions.
6. PR title shortened to 68 chars (was 71).

Downgraded findings (false positives from initial review):

"Missing assertMemoryLeak()" — compat tests extend AbstractTest (not AbstractCairoTest), which has no leak-checking infrastructure. No compat test uses assertMemoryLeak(). The test interacts via HTTP.
"Discarding metadata loses LocalKeyIsGlobal for symbols" — the scenario only triggers on externally-rewritten files where symbols would have been re-encoded as plain strings.
"Leaf vs field count could diverge for structs" — documented with a comment in meta.rs. No struct types exist today, and the fix degrades gracefully (falls back to physical type inference).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…et-metadata

glasstiger · 2026-03-25T13:58:41Z

[PR Coverage check]

😍 pass : 1 / 1 (100.00%)

file detail

	path	covered line	new line	coverage
🔵	qdbr/src/parquet_read/meta.rs	1	1	100.00%

…6885)

nwoolmer added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution rust Pull requests that update rust code labels Mar 18, 2026

nwoolmer force-pushed the fix-stale-qdb-parquet-metadata branch from 335eba0 to 691543b Compare March 18, 2026 19:21

Merge branch 'master' into fix-stale-qdb-parquet-metadata

2eb285c

nwoolmer and others added 2 commits March 18, 2026 20:04

Use multiline strings for SQL, add row count assertion

b1415ca

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

nwoolmer changed the title ~~fix(sql): fix read_parquet failure on Parquet files with stale QDB metadata~~ fix(sql): fix read_parquet on Parquet files with stale QDB metadata Mar 18, 2026

nwoolmer and others added 5 commits March 18, 2026 20:39

Add column type assertions, remove shadowed LOG, fix off-by-one

dd0add3

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Rebuild Rust libraries

ed438c3

Merge branch 'master' into fix-stale-qdb-parquet-metadata

29ed8d9

Rebuild Rust libraries

18689c4

Rebuild Rust libraries

8ecb338

ideoma self-requested a review March 24, 2026 18:10

ideoma added 2 commits March 25, 2026 13:11

add sqllogictest with parqet file to reproduce the problem

154ed98

Merge remote-tracking branch 'origin/master' into fix-stale-qdb-parqu…

48e26b2

…et-metadata

ideoma marked this pull request as draft March 25, 2026 13:11

ideoma marked this pull request as ready for review March 25, 2026 13:14

Rebuild Rust libraries

913c123

ideoma approved these changes Mar 25, 2026

View reviewed changes

bluestreak01 merged commit 28d6705 into master Mar 25, 2026
45 checks passed

bluestreak01 deleted the fix-stale-qdb-parquet-metadata branch March 25, 2026 14:51

ideoma pushed a commit that referenced this pull request Mar 26, 2026

fix(sql): fix read_parquet on Parquet files with stale QDB metadata (#…

e58994e

…6885)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sql): fix read_parquet on Parquet files with stale QDB metadata#6885

fix(sql): fix read_parquet on Parquet files with stale QDB metadata#6885
bluestreak01 merged 12 commits intomasterfrom
fix-stale-qdb-parquet-metadata

nwoolmer commented Mar 18, 2026 •

edited by ideoma

Loading

Uh oh!

coderabbitai bot commented Mar 18, 2026 •

edited

Loading

Review skipped

Uh oh!

bluestreak01 commented Mar 18, 2026

Uh oh!

nwoolmer commented Mar 18, 2026

Uh oh!

glasstiger commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

nwoolmer commented Mar 18, 2026 • edited by ideoma Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

coderabbitai bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

bluestreak01 commented Mar 18, 2026

Code Review

Critical

Moderate

Minor

Production code verdict

Summary

Uh oh!

nwoolmer commented Mar 18, 2026

Uh oh!

glasstiger commented Mar 25, 2026

[PR Coverage check]

file detail

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nwoolmer commented Mar 18, 2026 •

edited by ideoma

Loading

coderabbitai bot commented Mar 18, 2026 •

edited

Loading