Skip to content

feat(sql): support array column type in parquet partitions#5925

Merged
bluestreak01 merged 87 commits intomasterfrom
puzpuzpuz_arrays_in_parquet
Aug 15, 2025
Merged

feat(sql): support array column type in parquet partitions#5925
bluestreak01 merged 87 commits intomasterfrom
puzpuzpuz_arrays_in_parquet

Conversation

@puzpuzpuz
Copy link
Copy Markdown
Contributor

@puzpuzpuz puzpuzpuz commented Jul 9, 2025

Adds array column type support for table partitions in Apache Parquet format. This means that tables with array columns now can be converted to/from Parquet format.

CREATE TABLE x ( arr DOUBLE[], ts TIMESTAMP ) TIMESTAMP(ts) PARTITION BY DAY;

INSERT INTO x VALUES (ARRAY[1, 2, 3], '2000-01-01T00:00');
-- create a new latest partition (this partition won't be converted to Parquet)
INSERT INTO x VALUES (ARRAY[1, 2, 3], '2025-01-01T00:00');

-- convert the older partition to Parquet
ALTER TABLE x CONVERT PARTITION TO PARQUET where ts in '2000';

-- data from all partitions can be queried
SELECT * FROM x WHERE arr[1] = 1;

By default, arrays are exported as lists of double values. The Parquet field layout implements the requirements for lists. As a more lightweight, but less compatible with 3rd-party SW alternative, arrays can be exported in native binary format, i.e. as byte arrays. To do that, cairo.partition.encoder.parquet.raw.array.encoding.enabled=true config prop should be specified.

Other than that, includes the following:

  • Tests to verify that read_parquet() SQL function is able to read DuckDB-generated arrays (lists)
  • Designated timestamp column is now exported with required repetition. Also, fixes the sorting column index to be parquet file-local (table writer index was used).
  • Decoded parquet metadata now includes QDB column indexes, if they're present, - before this fix we were always returning parquet file-local indexes as column ids.

@puzpuzpuz puzpuzpuz self-assigned this Jul 9, 2025
@puzpuzpuz puzpuzpuz added the SQL Issues or changes relating to SQL execution label Jul 9, 2025
@questdb questdb deleted a comment from coderabbitai bot Jul 31, 2025
@questdb questdb deleted a comment from coderabbitai bot Jul 31, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Aug 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds Parquet array support end-to-end, including raw array encoding. Introduces schema/type encoding for arrays, read/write paths for arrays, JNI plumbing, configuration flag, and extensive tests. Updates Parquet writer/reader utilities, schema generation, encoding APIs (hybrid RLE), and metadata handling. Bumps “CreatedBy” to version 9.0 and adjusts toolchain and minor formatting.

Changes

Cohort / File(s) Summary
Java config: Parquet raw array encoding flag
core/src/main/java/io/questdb/PropServerConfiguration.java, core/src/main/java/io/questdb/PropertyKey.java, core/src/main/java/io/questdb/cairo/CairoConfiguration.java, core/src/main/java/io/questdb/cairo/CairoConfigurationWrapper.java, core/src/main/java/io/questdb/cairo/DefaultCairoConfiguration.java
Adds boolean property CAIRO_PARTITION_ENCODER_PARQUET_RAW_ARRAY_ENCODING_ENABLED, plumbs through CairoConfiguration API and wrappers.
Java Parquet writer/encoder/JNI plumbing
core/src/main/java/io/questdb/cairo/O3PartitionJob.java, core/src/main/java/io/questdb/cairo/TableWriter.java, core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionEncoder.java, core/src/main/java/io/questdb/griffin/engine/table/parquet/PartitionUpdater.java, core/rust/qdbr/src/parquet_write/jni.rs
Threads rawArrayEncoding flag and timestamp index through Java and JNI to native writer; updates encodeWithOptions and native signatures; determines designated timestamp; passes flag into ParquetWriter.
Java Parquet reader: arrays
core/src/main/java/io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java
Implements getArray via BorrowedArray buffers; adds resource cleanup and adjusts binary access offsets.
Java SQL/compiler adjustment
core/src/main/java/io/questdb/griffin/SqlCompilerImpl.java
Removes restriction blocking Parquet conversion for array columns.
Java tests (Parquet/arrays and API updates)
core/src/test/java/io/questdb/test/griffin/ParquetTest.java, core/src/test/java/io/questdb/test/cairo/ArrayTest.java, core/src/test/java/io/questdb/test/griffin/AlterTableConvertPartitionTest.java, core/src/test/java/io/questdb/test/griffin/engine/table/parquet/PartitionEncoderTest.java, core/src/test/java/io/questdb/test/griffin/engine/table/parquet/PartitionUpdaterTest.java, core/src/test/java/io/questdb/test/griffin/ParallelFilterTest.java, core/src/test/java/io/questdb/test/griffin/engine/table/parquet/ReadParquetFunctionTest.java, core/src/test/java/io/questdb/test/ServerMainTest.java, core/src/test/resources/sqllogictest/test/parquet/array_duckdb.test
Adds array tests and raw-array variants; updates expected outputs; adjusts calls to new encodeWithOptions signature; adds duckdb array sqllogictest; updates server parameter expectations. Formatting tweaks in some tests.
Compat tests
compat/src/test/java/io/questdb/compat/ParquetTest.java
Adds array tests (1D/2D) for V1/V2; introduces raw array encoding test variants; updates schema/metadata expectations to include array columns; updates CreatedBy to version 9.0.
Rust writer: arrays, schema, options
core/rust/qdbr/src/parquet_write/array.rs, .../file.rs, .../schema.rs, .../util.rs, .../primitive.rs, .../binary.rs, .../fixed_len_bytes.rs, .../string.rs, .../symbol.rs, .../boolean.rs, .../update.rs, .../mod.rs
Adds array page builders (nested and raw), array stats, level encoders (primitive/group) with explicit lengths, raw_array_encoding option in WriteOptions/ParquetWriter/Updater, schema support for arrays (nested or raw) and designated_timestamp; migrates stats to BinaryMaxMinStats; replaces encode_bool_iter with encode_primitive_def_levels; adapts bit-width/encoders to lengthed APIs.
Rust reader: arrays and decoding refactor
core/rust/qdbr/src/parquet_read/decode.rs, .../meta.rs, .../column_sink/var.rs, .../slicer/rle.rs, .../util.rs, core/rust/qdbr/src/parquet/mod.rs
Adds RawArrayColumnSink and array decode path (plain/delta length); computes page row counts; removes explicit Version parameter; integrates LevelsIterator; adds ARRAY_NDIMS_LIMIT and align8b; refactors RLE repeat iterators.
Rust core types
core/rust/qdb-core/src/col_type.rs, core/rust/qdb-core/src/col_driver/array.rs
Adds array encoding to ColumnType with dimensionality/element accessors and encode_array_type; exposes ArrayAuxEntry and its accessors.
Rust parquet2 API adjustments
core/rust/qdbr/parquet2/src/deserialize/native.rs, .../deserialize/utils.rs, .../encoding/hybrid_rle/bitmap.rs, .../encoding/hybrid_rle/encoder.rs, .../tests/it/write/binary.rs, .../tests/it/write/primitive.rs
Introduces explicit lifetimes for decoders; encode_bool/encode_u32 now require explicit length; updates call sites/tests to pass lengths.
Rust misc updates
core/rust/qdbr/src/parquet/error.rs, .../qdb_metadata.rs, .../allocator.rs, .../lib.rs, core/rust/qdbr/rust-toolchain.toml
Formatting changes for Display; removes backtrace display wrapper; minor error text updates; thread name formatting; toolchain bump to nightly-2025-02-07.

Sequence Diagram(s)

sequenceDiagram
  participant Java as Java caller
  participant PE as PartitionEncoder
  participant JNI as Native (JNI)
  participant PW as ParquetWriter
  participant S as Schema
  participant AW as Array Writer

  Java->>PE: encodeWithOptions(..., statistics, rawArrayEncoding, ...)
  PE->>JNI: encodePartition(..., statistics, rawArrayEncoding, ...)
  JNI->>PW: ParquetWriter::new(...).with_raw_array_encoding(rawArrayEncoding)
  PW->>S: to_parquet_schema(partition, rawArrayEncoding)
  alt rawArrayEncoding = true
    S-->>PW: ByteArray schema for arrays (raw)
    PW->>AW: array_to_raw_page(aux,data,...)
  else rawArrayEncoding = false
    S-->>PW: Nested LIST schema for arrays
    PW->>AW: array_to_page(primitive,dim,levels,...)
  end
  PW-->>JNI: pages written
  JNI-->>PE: success
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

storage

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch puzpuzpuz_arrays_in_parquet

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@bluestreak01
Copy link
Copy Markdown
Member

hey @puzpuzpuz the review is of the changes made in the PR:

image

@puzpuzpuz
Copy link
Copy Markdown
Contributor Author

puzpuzpuz commented Aug 14, 2025

hey @puzpuzpuz the review is of the changes made in the PR:

My bad, I was looking into Outside diff range comments (12).

@bluestreak01 bluestreak01 changed the title chore(sql): support array column type in parquet partitions feat(sql): support array column type in parquet partitions Aug 14, 2025
@glasstiger
Copy link
Copy Markdown
Contributor

[PR Coverage check]

😍 pass : 1402 / 1570 (89.30%)

file detail

path covered line new line coverage
🔵 lib.rs 0 1 00.00%
🔵 parquet/qdb_metadata.rs 0 2 00.00%
🔵 parquet_write/binary.rs 3 9 33.33%
🔵 parquet_write/varchar.rs 3 9 33.33%
🔵 parquet_write/string.rs 3 8 37.50%
🔵 allocator.rs 1 2 50.00%
🔵 parquet/error.rs 3 5 60.00%
🔵 parquet_read/column_sink/var.rs 25 37 67.57%
🔵 parquet_read/decode.rs 199 231 86.15%
🔵 parquet_write/boolean.rs 7 8 87.50%
🔵 parquet_write/update.rs 7 8 87.50%
🔵 parquet_read/meta.rs 64 71 90.14%
🔵 parquet_write/array.rs 674 741 90.96%
🔵 parquet_write/file.rs 187 205 91.22%
🔵 parquet_write/schema.rs 77 82 93.90%
🔵 io/questdb/griffin/engine/functions/table/ReadParquetRecordCursor.java 23 24 95.83%
🔵 parquet_write/jni.rs 20 21 95.24%
🔵 parquet_read/slicer/rle.rs 5 5 100.00%
🔵 parquet_write/fixed_len_bytes.rs 4 4 100.00%
🔵 io/questdb/PropServerConfiguration.java 2 2 100.00%
🔵 parquet_write/util.rs 74 74 100.00%
🔵 io/questdb/cairo/ColumnType.java 1 1 100.00%
🔵 parquet_write/symbol.rs 5 5 100.00%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/parquet/PartitionUpdater.java 1 1 100.00%
🔵 io/questdb/cairo/O3PartitionJob.java 1 1 100.00%
🔵 io/questdb/PropertyKey.java 1 1 100.00%
🔵 parquet_write/mod.rs 5 5 100.00%
🔵 io/questdb/cairo/TableWriter.java 1 1 100.00%
🔵 io/questdb/cairo/CairoConfigurationWrapper.java 1 1 100.00%
🔵 parquet/util.rs 3 3 100.00%
🔵 parquet_write/primitive.rs 1 1 100.00%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Enhance existing functionality SQL Issues or changes relating to SQL execution

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants