fix(core): fix read_parquet() crash on SYMBOL columns from native parquet files#6865
fix(core): fix read_parquet() crash on SYMBOL columns from native parquet files#6865bluestreak01 merged 6 commits intomasterfrom
Conversation
canProjectMetadata() passed the parquet file's actual column type (SYMBOL) to the Rust decoder instead of the expected type (VARCHAR) when converting SYMBOL to VARCHAR. The Rust decoder then used the SYMBOL decode path, which writes INT32 symbol keys into the output buffer. The Java side expected VARCHAR data (UTF-8 string pointers), causing a segfault. Pass expectedType (VARCHAR) for SYMBOL-to-VARCHAR conversions so the Rust decoder selects the correct BaseVarDictDecoder path that resolves dictionary entries to UTF-8 strings. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdjusts Parquet metadata/type handling for SYMBOL→VARCHAR promotion, fixes 64-bit masking when composing column metadata for exports, and adds a test validating native SYMBOL column decoding via read_parquet(). Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java (1)
74-77: Consolidate INSERT statements into a single multi-row INSERT.As per coding guidelines, use a single INSERT statement to insert multiple rows in tests.
♻️ Proposed fix
- execute("CREATE TABLE x (id SYMBOL, val INT, ts TIMESTAMP) TIMESTAMP(ts) PARTITION BY DAY"); - execute("INSERT INTO x VALUES ('AAA', 1, '2024-01-01T00:00:00.000000Z')"); - execute("INSERT INTO x VALUES ('BBB', 2, '2024-01-01T01:00:00.000000Z')"); - execute("INSERT INTO x VALUES ('AAA', 3, '2024-01-01T02:00:00.000000Z')"); + execute("CREATE TABLE x (id SYMBOL, val INT, ts TIMESTAMP) TIMESTAMP(ts) PARTITION BY DAY"); + execute( + """ + INSERT INTO x VALUES + ('AAA', 1, '2024-01-01T00:00:00.000000Z'), + ('BBB', 2, '2024-01-01T01:00:00.000000Z'), + ('AAA', 3, '2024-01-01T02:00:00.000000Z') + """ + );🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java` around lines 74 - 77, Combine the three separate execute("INSERT INTO x VALUES ...") calls into a single multi-row INSERT by replacing the three execute calls that insert ('AAA', 1, ...), ('BBB', 2, ...), ('AAA', 3, ...) with one execute("INSERT INTO x VALUES ('AAA', 1, '2024-01-01T00:00:00.000000Z'), ('BBB', 2, '2024-01-01T01:00:00.000000Z'), ('AAA', 3, '2024-01-01T02:00:00.000000Z')"); keep the existing execute method and the surrounding CREATE TABLE call intact so only the INSERTs are consolidated in ReadParquetFunctionTest.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java`:
- Around line 74-77: Combine the three separate execute("INSERT INTO x VALUES
...") calls into a single multi-row INSERT by replacing the three execute calls
that insert ('AAA', 1, ...), ('BBB', 2, ...), ('AAA', 3, ...) with one
execute("INSERT INTO x VALUES ('AAA', 1, '2024-01-01T00:00:00.000000Z'), ('BBB',
2, '2024-01-01T01:00:00.000000Z'), ('AAA', 3, '2024-01-01T02:00:00.000000Z')");
keep the existing execute method and the surrounding CREATE TABLE call intact so
only the INSERTs are consolidated in ReadParquetFunctionTest.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 15a4d520-7528-48cc-87ae-2e02faed6eeb
📒 Files selected for processing (2)
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.javacore/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
StreamingParquetBenchmarkTest has the same bit-packing pattern as CopyExportRequestTask: when symbolColumnType has bit 31 set (no-null flag), Java sign-extends it to 64 bits before the OR, clobbering the writerIdx in the upper 32 bits. Apply the same 0xFFFFFFFFL mask. Co-Authored-By: Claude Opus 4.6 <[email protected]>
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java
Show resolved
Hide resolved
[PR Coverage check]😍 pass : 3 / 3 (100.00%) file detail
|
|
Summary
read_parquet()crashes with SIGSEGV when reading parquet files containing SYMBOL columns encoded by QuestDB'sPartitionEncodercanProjectMetadata()passed the actual column type (SYMBOL) to the Rust decoder instead of the expected type (VARCHAR), causing the Rust decoder to write INT32 symbol keys that Java then read as VARCHAR pointersTest plan
testNativeSymbolColumnReadBackthat creates a table with SYMBOL columns, encodes it to parquet viaPartitionEncoder, and reads it back withread_parquet()🤖 Generated with Claude Code