fix(sql): fix read_parquet on Parquet files with stale QDB metadata#6885
fix(sql): fix read_parquet on Parquet files with stale QDB metadata#6885bluestreak01 merged 12 commits intomasterfrom
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
When an external tool (DuckDB, Spark, PyArrow) rewrites a QuestDB-exported Parquet file — e.g. dropping partition columns for hive-style directory layouts — it may preserve the original key-value metadata while changing the actual column set. The embedded QDB metadata schema then has more entries than Parquet columns, causing positional index lookups to return wrong types. This produced errors like: "encoding not supported, physical type: Double, encoding RleDictionary, ... column type ColumnType(26/Varchar)" The fix compares the QDB metadata schema length against the Parquet column count and discards the metadata when they differ, falling back to physical type inference. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
335eba0 to
691543b
Compare
Code ReviewCritical1. Test never actually injects stale metadata — passes vacuously The 3-column stale metadata JSON is always longer than the 2-column original JSON:
The padding loop at line 249 only runs when Additionally, if the HTTP The test passes on CI today but never exercises the bug scenario. It would pass identically with or without the Rust fix. Suggested fix: Reverse the injection direction — start with the 3-column file (which has longer QDB metadata) and replace it with a 2-column stale schema (shorter, so padding with spaces works). Or use QuestDB's 2. Test assertions are too weak — only checks non-null Assert.assertNotNull("read_parquet should return data", result);Even if injection succeeded, this only verifies the HTTP response is non-null. It does not verify:
The entire point of the fix is that stale 3-column metadata causes Double data to be decoded as VARCHAR. The assertion must verify the schema and data are correct. Moderate3. Silent early returns should be test failures Both early-return paths ( 4. Missing test for the reverse mismatch direction The Rust fix discards QDB metadata whenever 5. No The test allocates native memory through Minor6. SQL statements use string concatenation instead of text blocks ( 7. PR title is 71 characters (convention says under 70). Could shorten to Production code verdictThe Rust fix itself is correct, safe, and minimal: let qdb_meta = qdb_meta.filter(|m| m.schema.len() == col_len);
SummaryVerdict: Request changes — the production fix is correct, but the test does not exercise it. The test needs to be rewritten to guarantee metadata injection succeeds and to assert on the actual query results (column count, types, values). |
The original test was a no-op: the stale 3-column JSON was always longer than the original 2-column JSON, so the binary replacement silently returned without modifying the file. Flip the approach: export a 3-column file (3-column QDB metadata), then replace the metadata with a shorter 2-column version padded to the same byte length. The schema length mismatch (2 != 3) triggers the fix, which discards the stale metadata and falls back to physical type inference. Also fix: - Wrong column type constant (775 -> ColumnType.TIMESTAMP | 1<<17) - Replace silent early returns with Assert.fail() - Add real assertions on column names, data values, and column count - Add comment documenting the leaf-count assumption in meta.rs Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
@bluestreak01 All findings addressed in Critical fixes:
Moderate fixes: Downgraded findings (false positives from initial review):
|
[PR Coverage check]😍 pass : 1 / 1 (100.00%) file detail
|
This originated from parquet files being exported from QuestDB and later hive-partitioned. A follow-up is in progress to recover the original (missing) columns present in the metadata from the hive paths, which overlaps with this stale PR: #6369
Summary
read_parquetfailed with "encoding not supported, physical type: Double, encoding RleDictionary, column type ColumnType(26/Varchar)" when reading Parquet files whose embedded QDB metadata schema had a different number of columns than the actual Parquet dataTest plan
stale_qdb_metadata.testreads a 2-column Parquet file (stale_qdb_meta.parquet) whose footer contains stale 3-column QDB metadata, and verifiesread_parquetdiscards the mismatched metadata, infers types from Parquet physical types, and returns correct data🤖 Generated with Claude Code