Skip to content

Extension type support #2444

@Renkai

Description

@Renkai

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

My friends wrote a data format lance that has interconvertibility with parquet, and I want to make another implementation with Rust.
However, they used EXTENTION type,
it seems has not been implemented in arrow-rs.

Describe the solution you'd like

Let the reader can convert the parquet file with EXTENTION type to Arrow.
Describe alternatives you've considered

Additional context

Python code I used to generate such a file.

# with pyarrow-9.0.0
import pyarrow as pa


class UuidType(pa.PyExtensionType):

    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.binary(16))

    def __reduce__(self):
        return UuidType, ()


if __name__ == '__main__':
    uuid_type = UuidType()
    print(uuid_type.extension_name)
    print(uuid_type.storage_type)
    import uuid

    storage_array = pa.array([uuid.uuid4().bytes for _ in range(4)], pa.binary(16))
    arr = pa.ExtensionArray.from_storage(uuid_type, storage_array)
    print(arr)
    table = pa.Table.from_arrays([arr], names=["uuid"])
    import pyarrow.parquet as pq

    pq.write_table(table, "extension_example.parquet")
   
    # successfully read and print
    parquet_table = pq.read_table('extension_example.parquet')
    print("schema", parquet_table.schema)
    print("table", parquet_table)

Rust code that failed reading

        let input_file_name = "extension_example.parquet";
        //https://docs.rs/parquet/19.0.0/parquet/arrow/index.html
        use arrow::record_batch::RecordBatchReader;
        use parquet::arrow::{ParquetFileArrowReader, ArrowReader, ProjectionMask};
        use std::fs::File;

        let file = File::open(input_file_name).unwrap();

        let mut arrow_reader = ParquetFileArrowReader::try_new(file).unwrap();
        let mask = ProjectionMask::leaves(arrow_reader.parquet_schema(), [0]);
        println!("parquet schema is: {:?}", arrow_reader.parquet_schema());
        println!("Converted arrow schema is: {}", arrow_reader.get_schema().unwrap());

error log

thread 'tests::test_convert' panicked at 'called `Result::unwrap()` on an `Err` value: ArrowError("Unable to get root as message stored in ARROW:schema: Utf8Error { error: Utf8Error { valid_up_to: 0, error_len: Some(1) }, range: 216..255, error_trace: ErrorTrace([TableField { field_name: \"value\", position: 208 }, VectorElement { index: 0, position: 116 }, TableField { field_name: \"custom_metadata\", position: 92 }, VectorElement { index: 0, position: 48 }, TableField { field_name: \"fields\", position: 40 }, UnionVariant { variant: \"MessageHeader::Schema\", position: 24 }, TableField { field_name: \"header\", position: 24 }]) }")', src/lib.rs:39:77

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions