Skip to content

fix(sql): fix read_parquet on Parquet files with stale QDB metadata#6885

Merged
bluestreak01 merged 12 commits intomasterfrom
fix-stale-qdb-parquet-metadata
Mar 25, 2026
Merged

fix(sql): fix read_parquet on Parquet files with stale QDB metadata#6885
bluestreak01 merged 12 commits intomasterfrom
fix-stale-qdb-parquet-metadata

Conversation

@nwoolmer
Copy link
Copy Markdown
Contributor

@nwoolmer nwoolmer commented Mar 18, 2026

This originated from parquet files being exported from QuestDB and later hive-partitioned. A follow-up is in progress to recover the original (missing) columns present in the metadata from the hive paths, which overlaps with this stale PR: #6369

Summary

  • read_parquet failed with "encoding not supported, physical type: Double, encoding RleDictionary, column type ColumnType(26/Varchar)" when reading Parquet files whose embedded QDB metadata schema had a different number of columns than the actual Parquet data
  • This happens when an external tool (DuckDB, Spark, PyArrow) rewrites a QuestDB-exported Parquet file — e.g. dropping partition columns for hive-style directory layouts — but preserves the original key-value metadata
  • The decoder now compares the QDB metadata schema length against the Parquet column count and discards the metadata when they differ, falling back to physical type inference

Test plan

  • SQL logic test stale_qdb_metadata.test reads a 2-column Parquet file (stale_qdb_meta.parquet) whose footer contains stale 3-column QDB metadata, and verifies read_parquet discards the mismatched metadata, infers types from Parquet physical types, and returns correct data
  • Reverting the one-line fix and rebuilding the Rust binary reproduces the original error: "encoding not supported, physical type: Double, encoding RleDictionary, column type ColumnType(40/VarcharSlice)"
  • All existing parquet SQL logic tests pass (46 tests, parallel and sequential)
  • All existing Rust unit and integration tests pass

🤖 Generated with Claude Code

@nwoolmer nwoolmer added Bug Incorrect or unexpected behavior SQL Issues or changes relating to SQL execution rust Pull requests that update rust code labels Mar 18, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 18, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a01ccb10-82ab-438c-90d5-5b719dde7123

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-stale-qdb-parquet-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When an external tool (DuckDB, Spark, PyArrow) rewrites a
QuestDB-exported Parquet file — e.g. dropping partition columns for
hive-style directory layouts — it may preserve the original key-value
metadata while changing the actual column set. The embedded QDB
metadata schema then has more entries than Parquet columns, causing
positional index lookups to return wrong types.

This produced errors like:
  "encoding not supported, physical type: Double, encoding
  RleDictionary, ... column type ColumnType(26/Varchar)"

The fix compares the QDB metadata schema length against the Parquet
column count and discards the metadata when they differ, falling
back to physical type inference.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@nwoolmer nwoolmer force-pushed the fix-stale-qdb-parquet-metadata branch from 335eba0 to 691543b Compare March 18, 2026 19:21
@bluestreak01
Copy link
Copy Markdown
Member

@nwoolmer

Code Review

Critical

1. Test never actually injects stale metadata — passes vacuously
compat/src/test/java/io/questdb/compat/ParquetStaleMetadataTest.java:243-259

The 3-column stale metadata JSON is always longer than the 2-column original JSON:

  • Original (2 cols): {"version":1,"schema":[{"column_type":775,"column_top":0},{"column_type":10,"column_top":0}]} (~93 chars)
  • Stale (3 cols): {"version":1,"schema":[{"column_type":775,"column_top":0},{"column_type":26,"column_top":0},{"column_type":10,"column_top":0}]} (~127 chars)

The padding loop at line 249 only runs when staleMetaJson.length() < originalJson.length() (stale is shorter), which is the opposite of this case. The stale string is ~34 chars longer, so the loop body never executes. Then line 254 fires: staleMetaJson.length() > originalJson.length() is true, and the method returns at line 258 without injecting anything. The test then reads an unmodified parquet file, which read_parquet handles correctly regardless of the fix.

Additionally, if the HTTP /exp?fmt=parquet path does not embed QDB metadata (the BaseParquetExporter/HTTPSerialParquetExporter classes show no reference to QDB_META_KEY or QDB metadata), the first early-return at line 210-214 fires even before reaching the length comparison.

The test passes on CI today but never exercises the bug scenario. It would pass identically with or without the Rust fix.

Suggested fix: Reverse the injection direction — start with the 3-column file (which has longer QDB metadata) and replace it with a 2-column stale schema (shorter, so padding with spaces works). Or use QuestDB's PartitionEncoder directly to produce a file with controlled metadata, or write a Rust-level unit test in meta.rs that constructs a FileMetaData with a mismatched QDB schema.

2. Test assertions are too weak — only checks non-null
compat/src/test/java/io/questdb/compat/ParquetStaleMetadataTest.java:168

Assert.assertNotNull("read_parquet should return data", result);

Even if injection succeeded, this only verifies the HTTP response is non-null. It does not verify:

  • Column count is 2 (not 3)
  • Column types are correct (TIMESTAMP, DOUBLE — not TIMESTAMP, VARCHAR, DOUBLE)
  • Actual data values

The entire point of the fix is that stale 3-column metadata causes Double data to be decoded as VARCHAR. The assertion must verify the schema and data are correct.

Moderate

3. Silent early returns should be test failures
compat/src/test/java/io/questdb/compat/ParquetStaleMetadataTest.java:213,257

Both early-return paths (qdbIdx < 0 at line 213, and metadata too long at line 257) log a message but silently return success. These should be Assert.fail("...") to ensure the test fails loudly if the metadata injection setup is broken, rather than producing false green CI.

4. Missing test for the reverse mismatch direction

The Rust fix discards QDB metadata whenever m.schema.len() != col_len. This covers both "more metadata columns than data columns" and "fewer metadata columns than data columns". Only the former is tested (attempted). The latter scenario (external tool adds columns to the Parquet data but preserves original narrower metadata) should also be tested.

5. No assertMemoryLeak() wrapper

The test allocates native memory through PartitionEncoder, PartitionDescriptor, TableReader, and Path. Per coding standards, tests that allocate native memory should use assertMemoryLeak(). This test would also be more robust as a unit test in ReadParquetFunctionTest where assertMemoryLeak() and assertQueryNoLeakCheck() are standard, and the binary-patching approach becomes unnecessary.

Minor

6. SQL statements use string concatenation instead of text blocks (""")

7. PR title is 71 characters (convention says under 70). Could shorten to fix(sql): fix read_parquet on Parquet files with stale QDB metadata (68 chars).

Production code verdict

The Rust fix itself is correct, safe, and minimal:

let qdb_meta = qdb_meta.filter(|m| m.schema.len() == col_len);
  • All downstream code handles None gracefully, falling back to Parquet-native type inference
  • No panic risk, no resource leak, no concurrency issue
  • No false positives on valid QDB-written files (schema length always matches for QDB-authored Parquet)

Summary

Verdict: Request changes — the production fix is correct, but the test does not exercise it. The test needs to be rewritten to guarantee metadata injection succeeds and to assert on the actual query results (column count, types, values).

nwoolmer and others added 2 commits March 18, 2026 20:04
The original test was a no-op: the stale 3-column JSON was always
longer than the original 2-column JSON, so the binary replacement
silently returned without modifying the file.

Flip the approach: export a 3-column file (3-column QDB metadata),
then replace the metadata with a shorter 2-column version padded to
the same byte length. The schema length mismatch (2 != 3) triggers
the fix, which discards the stale metadata and falls back to
physical type inference.

Also fix:
- Wrong column type constant (775 -> ColumnType.TIMESTAMP | 1<<17)
- Replace silent early returns with Assert.fail()
- Add real assertions on column names, data values, and column count
- Add comment documenting the leaf-count assumption in meta.rs

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@nwoolmer nwoolmer changed the title fix(sql): fix read_parquet failure on Parquet files with stale QDB metadata fix(sql): fix read_parquet on Parquet files with stale QDB metadata Mar 18, 2026
@nwoolmer
Copy link
Copy Markdown
Contributor Author

@bluestreak01 All findings addressed in f3e99a01 and b1415caa9:

Critical fixes:

  1. Test no longer a no-op — flipped the injection direction. The test now exports a 3-column file (longer metadata), then replaces the 3-column QDB metadata with a shorter 2-column version padded to the same byte length. The injection succeeds reliably and the schema length mismatch (2 != 3) exercises the Rust fix.
  2. Column type constant fixed775 replaced with ColumnType.TIMESTAMP | (1 << 17) = 131080.
  3. Assertions strengthened — now verifies column names (ts, category, value), data value (1.5), row count ("count":10), and column count (exactly 3).
  4. Silent early returns eliminated — all guard conditions use Assert.assertTrue / Assert.assertEquals.

Moderate fixes:
5. SQL uses multiline strings (""") per CLAUDE.md conventions.
6. PR title shortened to 68 chars (was 71).

Downgraded findings (false positives from initial review):

  • "Missing assertMemoryLeak()" — compat tests extend AbstractTest (not AbstractCairoTest), which has no leak-checking infrastructure. No compat test uses assertMemoryLeak(). The test interacts via HTTP.
  • "Discarding metadata loses LocalKeyIsGlobal for symbols" — the scenario only triggers on externally-rewritten files where symbols would have been re-encoded as plain strings.
  • "Leaf vs field count could diverge for structs" — documented with a comment in meta.rs. No struct types exist today, and the fix degrades gracefully (falls back to physical type inference).

@ideoma ideoma self-requested a review March 24, 2026 18:10
@ideoma ideoma marked this pull request as draft March 25, 2026 13:11
@ideoma ideoma marked this pull request as ready for review March 25, 2026 13:14
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 1 / 1 (100.00%)

file detail

path covered line new line coverage
🔵 qdbr/src/parquet_read/meta.rs 1 1 100.00%

@bluestreak01 bluestreak01 merged commit 28d6705 into master Mar 25, 2026
45 checks passed
@bluestreak01 bluestreak01 deleted the fix-stale-qdb-parquet-metadata branch March 25, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Incorrect or unexpected behavior rust Pull requests that update rust code SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants