Skip to content

Support delta-encoded dictionaries in the Arrow IPC format #6783

@abhiaagarwal

Description

@abhiaagarwal

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm attempting to read Arrow IPC data from a Go source that uses Delta-encoded dictionaries. Arrow-rs does not seem to be able to parse and properly resolve it.

+----+----------+-------------------------+-----------------------------+-------------------------+----------------------------------+------------------+------------------+-------------------+------+--------+
| id | resource | scope                   | start_time_unix_nano        | duration_time_unix_nano | trace_id                         | span_id          | parent_span_id   | name              | kind | status |
+----+----------+-------------------------+-----------------------------+-------------------------+----------------------------------+------------------+------------------+-------------------+------+--------+
| 0  | {id: 0}  | {id: 0, name: __main__} | 2024-11-24T02:32:12.701141Z | PT161220S               | 4333035c6fc2b2575203f7eb736d1a61 | 4f13686daef62ace | f1a5af7c58d5b29c | child_operation_0 | 1    | {}     |
| 1  | {id: 0}  | {id: 0, name: __main__} | 2024-11-24T02:32:14.147818Z | PT126185S               | 456cdc4641073bee3599f703649a04f5 | 9704fb28a544a987 | 1a30f9660037c951 | child_operation_0 | 1    | {}     |
| 1  | {id: 0}  | {id: 0, name: __main__} | 2024-11-24T02:32:15.666419Z | PT132589S               | 930a2ef4cefd8ce990a388bfd6f72ee6 | 735d38465f4d7739 | c6ab1841b8cd7d99 | child_operation_0 | 1    | {}     |
| 1  | {id: 0}  | {id: 0, name: __main__} | 2024-11-24T02:32:12.863028Z | PT108051S               | 4333035c6fc2b2575203f7eb736d1a61 | b63c63b739fa2268 | f1a5af7c58d5b29c | child_operation_1 | 1    | {}     |
Field { name: "id", data_type: UInt16, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"encoding": "delta"} }, Field { name: "resource", data_type: Struct([Field { name: "id", data_type: UInt16, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"encoding": "delta"} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "scope", data_type: Struct([Field { name: "id", data_type: UInt16, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"encoding": "delta"} }, Field { name: "name", data_type: Dictionary(UInt8, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {"dictId": "1"} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "start_time_unix_nano", data_type: Timestamp(Nanosecond, Some("UTC")), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "duration_time_unix_nano", data_type: Dictionary(UInt8, Duration(Millisecond)), nullable: false, dict_id: 1, dict_is_ordered: false, metadata: {"dictId": "4"} }, Field { name: "trace_id", data_type: FixedSizeBinary(16), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "span_id", data_type: FixedSizeBinary(8), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "parent_span_id", data_type: FixedSizeBinary(8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "name", data_type: Dictionary(UInt8, Utf8), nullable: false, dict_id: 2, dict_is_ordered: false, metadata: {"dictId": "6"} }, Field { name: "kind", data_type: Dictionary(UInt8, Int32), nullable: true, dict_id: 3, dict_is_ordered: false, metadata: {"dictId": "7"} }, Field { name: "status", data_type: Struct([]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }

Since delta encoding is not supported, the id column is not incrementing.

Describe the solution you'd like

Arrow-rs should support Delta encoded dictionaries.

Describe alternatives you've considered

Additional context

Prior work:
https://github.com/apache/arrow-go/blob/e112ad0de5ca66133256f2db71bda4dd2d1731f5/arrow/internal/dictutils/dict.go
https://github.com/apache/arrow-go/blob/e112ad0de5ca66133256f2db71bda4dd2d1731f5/arrow/ipc/file_reader.go#L763
#5488 for a similar implementation for Arrow Flight

I've started to try solving this problem, mimicking the go implementation by modifying the private function fn read_dictionary_impl.

Metadata

Metadata

Assignees

Labels

arrowChanges to the arrow crateenhancementAny new improvement worthy of a entry in the changelog

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions