-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm attempting to read Arrow IPC data from a Go source that uses Delta-encoded dictionaries. Arrow-rs does not seem to be able to parse and properly resolve it.
+----+----------+-------------------------+-----------------------------+-------------------------+----------------------------------+------------------+------------------+-------------------+------+--------+
| id | resource | scope | start_time_unix_nano | duration_time_unix_nano | trace_id | span_id | parent_span_id | name | kind | status |
+----+----------+-------------------------+-----------------------------+-------------------------+----------------------------------+------------------+------------------+-------------------+------+--------+
| 0 | {id: 0} | {id: 0, name: __main__} | 2024-11-24T02:32:12.701141Z | PT161220S | 4333035c6fc2b2575203f7eb736d1a61 | 4f13686daef62ace | f1a5af7c58d5b29c | child_operation_0 | 1 | {} |
| 1 | {id: 0} | {id: 0, name: __main__} | 2024-11-24T02:32:14.147818Z | PT126185S | 456cdc4641073bee3599f703649a04f5 | 9704fb28a544a987 | 1a30f9660037c951 | child_operation_0 | 1 | {} |
| 1 | {id: 0} | {id: 0, name: __main__} | 2024-11-24T02:32:15.666419Z | PT132589S | 930a2ef4cefd8ce990a388bfd6f72ee6 | 735d38465f4d7739 | c6ab1841b8cd7d99 | child_operation_0 | 1 | {} |
| 1 | {id: 0} | {id: 0, name: __main__} | 2024-11-24T02:32:12.863028Z | PT108051S | 4333035c6fc2b2575203f7eb736d1a61 | b63c63b739fa2268 | f1a5af7c58d5b29c | child_operation_1 | 1 | {} |
Field { name: "id", data_type: UInt16, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"encoding": "delta"} }, Field { name: "resource", data_type: Struct([Field { name: "id", data_type: UInt16, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"encoding": "delta"} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "scope", data_type: Struct([Field { name: "id", data_type: UInt16, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"encoding": "delta"} }, Field { name: "name", data_type: Dictionary(UInt8, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"dictId": "1"} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "start_time_unix_nano", data_type: Timestamp(Nanosecond, Some("UTC")), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "duration_time_unix_nano", data_type: Dictionary(UInt8, Duration(Millisecond)), nullable: false, dict_id: 1, dict_is_ordered: false, metadata: {"dictId": "4"} }, Field { name: "trace_id", data_type: FixedSizeBinary(16), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "span_id", data_type: FixedSizeBinary(8), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "parent_span_id", data_type: FixedSizeBinary(8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "name", data_type: Dictionary(UInt8, Utf8), nullable: false, dict_id: 2, dict_is_ordered: false, metadata: {"dictId": "6"} }, Field { name: "kind", data_type: Dictionary(UInt8, Int32), nullable: true, dict_id: 3, dict_is_ordered: false, metadata: {"dictId": "7"} }, Field { name: "status", data_type: Struct([]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }
Since delta encoding is not supported, the id column is not incrementing.
Describe the solution you'd like
Arrow-rs should support Delta encoded dictionaries.
Describe alternatives you've considered
Additional context
Prior work:
https://github.com/apache/arrow-go/blob/e112ad0de5ca66133256f2db71bda4dd2d1731f5/arrow/internal/dictutils/dict.go
https://github.com/apache/arrow-go/blob/e112ad0de5ca66133256f2db71bda4dd2d1731f5/arrow/ipc/file_reader.go#L763
#5488 for a similar implementation for Arrow Flight
I've started to try solving this problem, mimicking the go implementation by modifying the private function fn read_dictionary_impl.