feat(sql): support array column type in parquet partitions#5925
feat(sql): support array column type in parquet partitions#5925bluestreak01 merged 87 commits intomasterfrom
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds Parquet array support end-to-end, including raw array encoding. Introduces schema/type encoding for arrays, read/write paths for arrays, JNI plumbing, configuration flag, and extensive tests. Updates Parquet writer/reader utilities, schema generation, encoding APIs (hybrid RLE), and metadata handling. Bumps “CreatedBy” to version 9.0 and adjusts toolchain and minor formatting. Changes
Sequence Diagram(s)sequenceDiagram
participant Java as Java caller
participant PE as PartitionEncoder
participant JNI as Native (JNI)
participant PW as ParquetWriter
participant S as Schema
participant AW as Array Writer
Java->>PE: encodeWithOptions(..., statistics, rawArrayEncoding, ...)
PE->>JNI: encodePartition(..., statistics, rawArrayEncoding, ...)
JNI->>PW: ParquetWriter::new(...).with_raw_array_encoding(rawArrayEncoding)
PW->>S: to_parquet_schema(partition, rawArrayEncoding)
alt rawArrayEncoding = true
S-->>PW: ByteArray schema for arrays (raw)
PW->>AW: array_to_raw_page(aux,data,...)
else rawArrayEncoding = false
S-->>PW: Nested LIST schema for arrays
PW->>AW: array_to_page(primitive,dim,levels,...)
end
PW-->>JNI: pages written
JNI-->>PE: success
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested labels
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
|
hey @puzpuzpuz the review is of the changes made in the PR:
|
My bad, I was looking into |
Removed unused encode_data_plain function that was marked as dead code and kept only for compatibility. The streaming version encode_data_plain_streaming is now used instead. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
[PR Coverage check]😍 pass : 1402 / 1570 (89.30%) file detail
|

Adds array column type support for table partitions in Apache Parquet format. This means that tables with array columns now can be converted to/from Parquet format.
By default, arrays are exported as lists of double values. The Parquet field layout implements the requirements for lists. As a more lightweight, but less compatible with 3rd-party SW alternative, arrays can be exported in native binary format, i.e. as byte arrays. To do that,
cairo.partition.encoder.parquet.raw.array.encoding.enabled=trueconfig prop should be specified.Other than that, includes the following:
read_parquet()SQL function is able to read DuckDB-generated arrays (lists)