-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add hooks to json encoder to override default encoding or add support for unsupported types #7015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
TODO: add a test showing how to change the encoding of binary arrays. |
|
cc @tustvold |
Done |
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be Option<&'a NullBuffer>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably could be for consistency, but it isn't likely to have any material performance impact
tustvold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a review, and think it is probably fine to expose Encoder, it is a relatively straightforward trait that I don't see changing. It is the reader side that I'd be much more hesitant about altering, as that is quite coupled with the parsing machinery.
That being said there is one slight wrinkle, as documented here
The implementation of Encoder for ArrayFormatter assumes it does not produce
characters that would need to be escaped within a JSON string, e.g.'"'.
If support for user-provided format specifications is added, this assumption
may need to be revisited
I think we need to do the following:
- Remove the impl of Encoder for ArrayFormatter in favour of a private newtype wrapper
- Clearly document on Encoder the expectation that it generates valid escaped JSON
On a related note, the return type of Box<dyn Encoder + 'a>, Option<NullBuffer> is a bit of a mouthful, and honestly makes for a bit of a funky API where the semantics of an Encoder depend on a different data structure returned at construction time.
I wonder if a better API might be something like
pub trait Encoder {
fn encode(&mut self, idx: usize, out: &mut Vec<u8>);
fn nulls(&self) -> Option<NullBuffer>;
}
I also left a comment on potentially simplifying the factory down to a single method
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we pass both DataType and Array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to, it just simplifies the implementation of the EncoderFactory a bit because almost surely all of them are going to call array.data_type(). I can remove if you'd prefer.
arrow-json/src/writer/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upsies!
arrow-json/src/writer/mod.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use AsArray instead of manual downcasts, it is much easier to read
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? AFAICT we could remove this with no real loss of functionality, if anything it would be more resilient as adding a new default encoder wouldn't break existing overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll drop it in favor of a single method, I think it should work for our use case.
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably could be for consistency, but it isn't likely to have any material performance impact
|
Thank you for the review. I've pushed a (broken) version with most comments addressed aside from the part about |
|
|
|
Okay I think I understand now. Looking at the implementation it seems to me that it wouldn't be that much extra work to add formatters for Date, Timestamp, Time and Decimal, then we could ditch ArrayFormatter for JSON encoding altogether. I can do that as a prequel PR. Wdyt? |
|
A private newtype wrapper would avoid a non-trivial amount of code duplication whilst resolving the issue, and would be my preferred approach |
|
Makes sense, done! There is quite a bit of code churn from changing the |
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be interested in using the changes in this PR to write GeoJSON from geoarrow-rs.
However this API would not be sufficient for me because it assumes that the physical Array is enough to know how to encode the data. This is not true for geospatial data (at least for Arrow data according to the GeoArrow specification) because the same physical layout can describe multiple types.
E.g. an array of LineString and an array of MultiPoint would both be stored as an Arrow List[FixedSizeList[2, Float64]], but the extension metadata on the Field would be necessary to know whether to write "type": "MultiPoint" or "type": "LineString" in each JSON object.
Given that the existing json Writer API writes a RecordBatch, it should be possible to access the Field and pass that through here, instead of just using the Array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point yes I think that should be possible 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylebarron will this work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so! As long as we can access the field that should be enough!
|
Hi @tustvold sorry to ping you. Given the use cases beyond our JSON serialization, would you mind taking another look at this PR when you have some time? Thanks! |
|
I'm a bit swamped at the moment, but I'll try to take a look this weekend. |
0f032fe to
19ed4ab
Compare
|
I've started trying to replace our vendored version with this + a hook to handle the JSON union and have a couple of notes:
|
19ed4ab to
6276df6
Compare
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this change will have major performance implications, as you're now adding a per-value dyn dispatch in order to check nullability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I guess it does need to be dynamic, but I take it you'd rather have a method that returns a NullBuffer the length of the array in one go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not seeing why it needs to be changed, this wasn't necessary before...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to remember now but I think the primary motivation was to encapsulate the (Encoder, NullBuffer) into the Encoder to make the whole API easier to work with.
I've now collapsed these into a single method that returns an Option<NullBuffer> which can be hoisted out of loops. I think that should be equivalent to the old implementation but with a better API for implementers.
tustvold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I have been (and still am) really swamped. I think this is getting closer to something we can merge, although the changes to null handling concern me, and I'd prefer to revert these if possible.
|
Thanks for reviewing! I'll try to work out how to do the null handling differently. |
arrow-json/src/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still doing this dynamic dispatch to get null information on every value write... Unless we can show with benchmarks it doesn't regress performance I'm lukewarm on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear though, it currently does do dynamic dispatch for every non-null value write, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just rn
cargo bench --bench json_writerAnd see what the numbers say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had run it and there was a measurable (~30%) slowdown. This makes sense, it was doing 2 dynamic dispatch calls for every row instead of just 1.
I've now refactored this to avoid the dynamic dispatch altogether by encapsulating the dynamic encoder + null buffer in a helper struct. So it's basically the same as before except that things get passed around as a struct instead of a tuple of unrelated values. I don't love the name EncoderWithNullBuffer, open to suggestions.
Benchmarks now show no difference vs. main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❯ git checkout main && cargo bench --bench json_writer && git checkout arrow-union && cargo bench --bench json_writer
Switched to branch 'main'
Your branch is ahead of 'origin/main' by 6 commits.
(use "git push" to publish your local commits)
Compiling arrow-json v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow-json)
Compiling arrow v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow)
Finished `bench` profile [optimized] target(s) in 3.03s
Running benches/json_writer.rs (target/release/deps/json_writer-8d99cf5d0d08efba)
Gnuplot not found, using plotters backend
bench_integer time: [2.9239 ms 2.9488 ms 2.9810 ms]
change: [-1.4155% -0.3843% +0.7943%] (p = 0.52 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
bench_float time: [3.4054 ms 3.4216 ms 3.4394 ms]
change: [-3.7264% -2.9504% -2.1609%] (p = 0.00 < 0.05)
Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
8 (8.00%) high mild
7 (7.00%) high severe
bench_string time: [8.8852 ms 8.9466 ms 9.0106 ms]
change: [-1.3959% -0.5660% +0.3531%] (p = 0.21 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
bench_mixed time: [11.975 ms 12.030 ms 12.087 ms]
change: [-3.7701% -3.0819% -2.4763%] (p = 0.00 < 0.05)
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
bench_dict_array time: [4.6221 ms 4.6771 ms 4.7345 ms]
change: [-2.4806% -0.5873% +1.2096%] (p = 0.53 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
bench_struct time: [19.521 ms 19.623 ms 19.731 ms]
change: [-0.9078% +0.1000% +0.9733%] (p = 0.84 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
bench_nullable_struct time: [9.4059 ms 9.4565 ms 9.5092 ms]
change: [-3.5737% -2.7898% -1.9920%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
bench_list time: [27.854 ms 27.976 ms 28.118 ms]
change: [-0.8182% -0.3309% +0.1897%] (p = 0.22 > 0.05)
No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
7 (7.00%) high mild
3 (3.00%) high severe
bench_nullable_list time: [5.9283 ms 5.9849 ms 6.0429 ms]
change: [-8.7315% -6.8046% -5.0511%] (p = 0.00 < 0.05)
Performance has improved.
bench_struct_list time: [2.9264 ms 2.9694 ms 3.0130 ms]
change: [-5.8590% -3.8633% -1.8735%] (p = 0.00 < 0.05)
Performance has improved.
Switched to branch 'arrow-union'
Your branch is up to date with 'origin/arrow-union'.
Compiling arrow-json v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow-json)
Compiling arrow v54.2.1 (/Users/adriangb/GitHub/arrow-rs/arrow)
Finished `bench` profile [optimized] target(s) in 3.03s
Running benches/json_writer.rs (target/release/deps/json_writer-8d99cf5d0d08efba)
Gnuplot not found, using plotters backend
bench_integer time: [2.9105 ms 2.9313 ms 2.9542 ms]
change: [-1.8789% -0.5924% +0.5939%] (p = 0.36 > 0.05)
No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
1 (1.00%) high mild
2 (2.00%) high severe
bench_float time: [3.3695 ms 3.3854 ms 3.4041 ms]
change: [-1.7458% -1.0557% -0.3581%] (p = 0.00 < 0.05)
Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
5 (5.00%) high mild
1 (1.00%) high severe
bench_string time: [8.7954 ms 8.8473 ms 8.9044 ms]
change: [-2.0603% -1.1094% -0.2183%] (p = 0.02 < 0.05)
Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
bench_mixed time: [12.034 ms 12.099 ms 12.170 ms]
change: [-0.1335% +0.5726% +1.3337%] (p = 0.13 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
bench_dict_array time: [4.4275 ms 4.4639 ms 4.5020 ms]
change: [-5.9236% -4.5580% -3.1699%] (p = 0.00 < 0.05)
Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
bench_struct time: [19.450 ms 19.538 ms 19.640 ms]
change: [-1.1211% -0.4330% +0.3004%] (p = 0.25 > 0.05)
No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
bench_nullable_struct time: [9.4349 ms 9.4862 ms 9.5417 ms]
change: [-0.4570% +0.3147% +1.0725%] (p = 0.43 > 0.05)
No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
bench_list time: [27.799 ms 27.901 ms 28.015 ms]
change: [-0.8950% -0.2687% +0.2907%] (p = 0.40 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
bench_nullable_list time: [5.9567 ms 6.0222 ms 6.0886 ms]
change: [-0.9060% +0.6229% +2.1762%] (p = 0.41 > 0.05)
No change in performance detected.
bench_struct_list time: [2.8755 ms 2.9589 ms 3.0925 ms]
change: [-3.6152% -0.3533% +4.3395%] (p = 0.89 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
tustvold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes make the null handling worse 😅
The API was designed the way it was for a reason, I think if we want to change it we need to justify doing so empirically
| /// Note that the type of the field may not match the type of the array: for dictionary arrays unless the top-level dictionary is handled this | ||
| /// will be called again for the keys and values of the dictionary, at which point the field type will still be the outer dictionary type but the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a problem for structs or lists because each sub-array has an associated Field.
But for dictionary there are only inner types, not fields.
We could create a new Field::new("values", *values_type.clone(), field.is_nullable()) but I'm not 100% sure that's right and you lose access to metadata on the dictionary field itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the whole reason for propogating the field is for the metadata, I think it makes sense to accept there might be some inconsistency
|
I pushed 60bcd18 and here's what I get from |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @adriangb
I also re-ran the benchmarks and they look good to me now
group arrow-union main
----- ----------- ----
bench_dict_array 1.00 8.5±0.14ms ? ?/sec 1.00 8.6±0.21ms ? ?/sec
bench_float 1.06 6.5±0.09ms ? ?/sec 1.00 6.2±0.05ms ? ?/sec
bench_integer 1.08 6.0±0.08ms ? ?/sec 1.00 5.5±0.06ms ? ?/sec
bench_list 1.00 84.8±0.69ms ? ?/sec 1.00 84.9±0.71ms ? ?/sec
bench_mixed 1.00 32.9±1.13ms ? ?/sec 1.03 34.0±0.32ms ? ?/sec
bench_nullable_list 1.03 13.4±0.30ms ? ?/sec 1.00 13.0±0.27ms ? ?/sec
bench_nullable_struct 1.00 19.8±0.23ms ? ?/sec 1.07 21.2±0.76ms ? ?/sec
bench_string 1.00 15.9±0.20ms ? ?/sec 1.04 16.6±0.09ms ? ?/sec
bench_struct 1.01 55.5±0.60ms ? ?/sec 1.00 55.0±0.41ms ? ?/sec
bench_struct_list 1.06 11.4±0.53ms ? ?/sec 1.00 10.8±1.15ms ? ?/sec
Co-authored-by: Raphael Taylor-Davies <[email protected]>
tustvold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a quick look through and this looks good to me other than some minor nits I flagged, thank you for sticking with this, and sorry it has taken so long to get approved.
| /// Note that the type of the field may not match the type of the array: for dictionary arrays unless the top-level dictionary is handled this | ||
| /// will be called again for the keys and values of the dictionary, at which point the field type will still be the outer dictionary type but the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the whole reason for propogating the field is for the metadata, I think it makes sense to accept there might be some inconsistency
Co-authored-by: Raphael Taylor-Davies <[email protected]>
|
Thank you for the careful reviews and patience! Two approvals, shall we merge? |
|
As this is a breaking change it will need to wait until main opens up for such changes, which I believe should be after the current release process concludes. |
Yes I plan to complete the current release tomorrow or Friday Then I'll try and merge whatever is ready to go into main. |
|
I merged up from main and plan to merge this PR when CI passes |
We essentially had to vendor this module to implement this. I know this has been discussed in the past and there was pushback against exposing
Encoderto the public API but since I had to go through the work of updating our vendored version I thought I'd at least try to make a PR to upstream this to make our lives easier in the future and bring value back to the community.In my opinion exposing this is a rather small adition the public API and any breaking changes in the future will be restricted to the JSON encoder, which is not a core part of the project. It seems worth it and addresses various requests e.g. how to encode a binary array (base64 encoded string or array of ints are both common). It has also been a stable API for ~ 1 year now.