Implement `RecordBatch` <--> `FlightData` encode/decode + tests by alamb · Pull Request #3391 · apache/arrow-rs

alamb · 2022-12-25T22:29:02Z

Which issue does this PR close?

(note a large amount of this PR is test code)

Rationale for this change

The details of encoding / sending / receiving / reconstructiing RecordBatches over flight is common and somewhat duplicated across every implementation of flight. The mid level flight client aims to automate much of the common piece of this

What changes are included in this PR?

Add encode.rs logic in FlightDataEncoderBuilder (based on the code from IOx in https://github.com/influxdata/influxdb_iox/pull/6460)
Move decode logic out of client and into decode.rs
Make the decoders more generic
Tests: round trip tests of encoding/decoding
Tests for FlightClient::do_get

Are there any user-facing changes?

yes, encoders and decoders

Next planned PRs:

Rewrite FlightSQL client in terms of the mid level client.

alamb · 2022-12-25T22:30:06Z

arrow-flight/src/client.rs

-/// calling [`Self::into_inner`] and using the [`FlightDataStream`]
-/// directly.
-#[derive(Debug)]
-pub struct FlightRecordBatchStream {


This is moved / renamed / tested in decode.rs

alamb · 2022-12-25T22:31:11Z

arrow-flight/src/encode.rs

+// specific language governing permissions and limitations
+// under the License.
+
+use std::{collections::VecDeque, fmt::Debug, pin::Pin, sync::Arc, task::Poll};


This is code based on https://github.com/influxdata/influxdb_iox/pull/6460,

It handles encoding RecordBatches into FlightData and the details of dictionaries, etc.

alamb · 2022-12-25T22:31:47Z

arrow-flight/src/encode.rs

+    /// Ipc writer options
+    options: IpcWriteOptions,
+    /// Metadata to add to the schema message
+    app_metadata: Bytes,


I hope eventually there can be options related to dictionary encoding here -- like "try and match dictionaries" for example

alamb · 2022-12-25T22:33:07Z

arrow-flight/tests/encode_decode.rs

+// specific language governing permissions and limitations
+// under the License.
+
+//! Tests for round trip encoding / decoding


Here are the round trip tests for encode/decode (that ensure that RecordBatches sent via Flight get through correctly

I expect that we can use this framework as we sort out how to properly encode dictionaries (e.g. #3389)

alamb · 2022-12-25T22:33:21Z

arrow-ipc/src/writer.rs

 use crate::CONTINUATION_MARKER;

-/// IPC write options used to control the behaviour of the writer
+/// IPC write options used to control the behaviour of the [`IpcDataGenerator`]


drive by doc fixes

alamb · 2022-12-25T22:34:40Z

arrow-flight/tests/client.rs


-// TODO test for do_get
+#[tokio::test]
+async fn test_do_get() {


here are the tests for do_get promised in #3378

alamb · 2022-12-25T22:36:19Z

cc @avantgardnerio @nevi-me and @tustvold

viirya · 2022-12-26T00:03:49Z

arrow-flight/src/decode.rs

+                    DecodedPayload::Schema(_) => {
+                        self.got_schema = true;
+                        // Need next message, poll inner again
+                    }
+                    DecodedPayload::RecordBatch(batch) => {
+                        return Poll::Ready(Some(Ok(batch)));
+                    }


Do we need to verify the schema of record batch against received schema?

I actually would like to avoid verifying the schema at the lower level decoder so that clients can send other schema messages if they so wish (we actually do this in IOx).

The higher level record batch decoder, however, should verify the I think

I added tests to illustrate these behaviors in test_chained_streams_batch_decoder and test_chained_streams_data_decoder

I also added test_mismatched_schema_message to test sending an incorrect schema message. ~~Currently it panics but we can probably turn that into a useful message at some point~~ I improved the message as well

arrow-flight/src/encode.rs

Co-authored-by: Liang-Chi Hsieh <[email protected]>

…s into alamb/flight_data_transfer

alamb · 2022-12-26T15:12:47Z

arrow-ipc/src/reader.rs

            make_array(data)
        }
        _ => {
+            if nodes.len() <= node_index {


The code panic's at nodes.get(node_index) below -- so this change just makes an error rather than a panic, which is a slightly better user experience

alamb · 2022-12-27T15:52:02Z

@tustvold if you have a chance, I would appreciate a review of this PR.

I am particularly interested in your opinion of how we should handle sending RecordBatch'es and retain the dictionary encoding.

One way might be to have the encode stream match / assign dictionary ids prior to sending. I realize this would be less efficient than if the creators of the dictionaries were to set the ids, but it would likely be more efficient than hydrating the dictionaries into actual data.

tustvold · 2022-12-28T11:10:36Z

I am particularly interested in your opinion of how we should handle sending RecordBatch'es and retain the dictionary encoding.

I think we could just assign a unique ID to each dictionary encoded field, and then send a new DictionaryBatch for each field for each RecordBatch. We could then potentially retain the previous dictionary and use ArrayData::ptr_eq to elide sending the exact same dictionary again.

tustvold

Mostly just some minor nits, I personally would avoid upstreaming the hydration of dictionaries so that we aren't committed to supporting it long term, I think it should be relatively straightforward to properly support dictionaries in a somewhat sane manner

tustvold · 2022-12-28T11:23:08Z

arrow-flight/tests/encode_decode.rs

+            if i == i / 2 {
+                None
+            } else {
+                // repeat some values for low cardinality
+                let v = i / 3;
+                Some(format!("value{v}"))
+            }


Suggested change

if i == i / 2 {

None

} else {

// repeat some values for low cardinality

let v = i / 3;

Some(format!("value{v}"))

}

// repeat some values for low cardinality

(i != i / 2).then(|| format!("value{}", i / 3))

I also wonder if i == i / 2 is what you meant to write, as that will only be true for 0

you are right -- I meant num_rows / 2 -- thank you

As above, I think the then formulation is harder to read so I would like to leave the current less concise formulaton

tustvold · 2022-12-28T11:24:05Z

arrow-flight/tests/encode_decode.rs

+fn make_primative_batch(num_rows: usize) -> RecordBatch {
+    let i: UInt8Array = (0..num_rows)
+        .map(|i| {
+            if i == num_rows / 2 {


This could be rewritten with Option::then as below, to make it significantly more concise

It is true -- this would be more concise. However after trying it (see below) I think the current formulation is easier to read, so I plan to leave it as is.

Thank you for the suggestion

let i: UInt8Array = (0..num_rows) - .map(|i| { - if i == num_rows / 2 { - None - } else { - Some(i.try_into().unwrap()) - } - }) + .map(|i| (i != num_rows / 2).then(|| i.try_into().unwrap())) .collect();

arrow-flight/tests/encode_decode.rs

tustvold · 2022-12-28T11:28:36Z

arrow-flight/src/encode.rs

+/// 1. Hydrates any dictionaries to its underlying type. See
+/// hydrate_dictionary for more information.
+///
+pub fn prepare_batch_for_flight(


This method is stateless which means it necessarily will never be able to properly handle dictionaries, given we just deprecated a similar method in the IPC reader, perhaps we should avoid doing this in favour of the stateful encoder above?

I am not quite sure what you are suggesting here. I removed the pub in 2267d13 so that we can have more flexibility with non breaking API changes in the future.

tustvold · 2022-12-28T11:30:00Z

arrow-flight/src/encode.rs

+/// * <https://github.com/apache/arrow-rs/issues/1206>
+///
+/// For now we just hydrate the dictionaries to their underlying type
+fn hydrate_dictionary(array: &ArrayRef) -> Result<ArrayRef> {


I wonder if we should keep this as a workaround in IOx, and only upstream the proper dictionary support once it is ready? I just wonder if this is a mode we really want to support long-term, especially if we have it the default behaviour?

See #3391 (comment)

I will attempt to get rid of this workaround prior to the arrow 31 release

arrow-flight/src/encode.rs

alamb · 2022-12-28T18:03:49Z

Mostly just some minor nits, I personally would avoid upstreaming the hydration of dictionaries so that we aren't committed to supporting it long term, I think it should be relatively straightforward to properly support dictionaries in a somewhat sane manner

Thank you -- I will investigate how much effort it will take to do so. I am almost out of time today so I will likely not get a chance to work on this until later in the week

arrow-flight/src/decode.rs

arrow-flight/src/encode.rs

viirya · 2022-12-29T00:18:12Z

arrow-flight/src/encode.rs

+                None => {
+                    // inner is done
+                    self.done = true;
+                }


Once reaching here, I guess the queue is guaranteed to be empty? If so, seems we can just return Poll::Ready(None); here?

Yes that is correct. This case is also handled on the next loop iteration, but I think making it explicit is good too. I did so in 4f30ab7

alamb · 2022-12-29T14:32:42Z

Mostly just some minor nits, I personally would avoid upstreaming the hydration of dictionaries so that we aren't committed to supporting it long term

I have pondered this for a while and my plan is to proceed with merging this PR (with the temporary dictionary hydration) and work on proper dictionary support as a follow on PR for the following reasons:

I want to make sure that the rust dictionary handling is correct and for that I would like to use the existing flight integration tests but they need work to update them to the new client too
I have one PR (and about to be more than one) queued up waiting on this PR

In order to minimize the API churn however, I plan to hold this PR until after we have released 30.0.0 (#3336 ) so we can minimize the chance of releasing the dictionary hydration code

Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

…ransfer

ursabot · 2022-12-31T13:01:51Z

Benchmark runs are scheduled for baseline = 9398af6 and contender = dc09b0b. dc09b0b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-12-31T13:20:05Z

The follow on PR is ready: #3402

Implement RecordBatch <--> FlightData encode/decode + tests

da3fca6

github-actions bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Dec 25, 2022

alamb commented Dec 25, 2022

View reviewed changes

fix comment

ad3c1fe

viirya reviewed Dec 26, 2022

View reviewed changes

arrow-flight/src/encode.rs Outdated Show resolved Hide resolved

alamb and others added 7 commits December 26, 2022 09:09

Update arrow-flight/src/encode.rs

feeedf0

Co-authored-by: Liang-Chi Hsieh <[email protected]>

Add test encoding error

4671454

Add test for chained streams

69f32a9

Merge branch 'alamb/flight_data_transfer' of github.com:alamb/arrow-r…

3ba9de6

…s into alamb/flight_data_transfer

Add mismatched schema and data test

2866191

Add new test

a6e61b1

more tests

2249871

alamb commented Dec 26, 2022

View reviewed changes

alamb requested a review from tustvold December 27, 2022 15:50

This was referenced Dec 27, 2022

Rewrite FlightSQL client in terms of the mid level client (WIP) #3401

Closed

Complete mid-level FlightClient #3402

Merged

tustvold approved these changes Dec 28, 2022

View reviewed changes

viirya reviewed Dec 29, 2022

View reviewed changes

viirya approved these changes Dec 29, 2022

View reviewed changes

alamb and others added 4 commits December 29, 2022 09:37

Apply suggestions from code review

4797c83

Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

Add From ArrowError impl for FlightError

c5402ff

Correct make_dictionary_batch and add tests

694265d

do not take

ed1f85c

Make dictionary massaging non pub

2267d13

alamb mentioned this pull request Dec 29, 2022

ArrayDataget_slice_memory_size or similar #3407

Closed

alamb added 4 commits December 29, 2022 09:13

Add comment about memory size and make split function non pub

a6ba713

explicitly return early from encode stream

4f30ab7

fix doc link

ba5e698

Merge remote-tracking branch 'apache/master' into alamb/flight_data_t…

77ac622

…ransfer

alamb merged commit dc09b0b into apache:master Dec 31, 2022

alamb deleted the alamb/flight_data_transfer branch December 31, 2022 12:58

Comments

Conversation

alamb commented Dec 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Next planned PRs:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Dec 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 27, 2022

Uh oh!

tustvold commented Dec 28, 2022

Uh oh!

tustvold left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Dec 28, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 29, 2022

Uh oh!

ursabot commented Dec 31, 2022

Uh oh!

alamb commented Dec 31, 2022

alamb commented Dec 25, 2022 •

edited

Loading

alamb Dec 26, 2022 •

edited

Loading