Support binary data type in build_struct_array.#702
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #702 +/- ##
=======================================
Coverage 82.45% 82.46%
=======================================
Files 168 168
Lines 47397 47419 +22
=======================================
+ Hits 39083 39103 +20
- Misses 8314 8316 +2 ☔ View full report in Codecov by Sentry. |
jorgecarleitao
left a comment
There was a problem hiding this comment.
Thanks!
Out of curiosity, isn't json always valid utf8?
| DataType::Struct(vec![ | ||
| Field::new("key", DataType::Utf8, false), | ||
| Field::new("key", DataType::Int32, true), | ||
| Field::new("value", DataType::Int32, true), |
There was a problem hiding this comment.
An oversight on my end, should have been "value"
There was a problem hiding this comment.
Haha, learning from previous PR and found this typo ;)
| DataType::Struct(vec![ | ||
| Field::new("key", DataType::Utf8, false), | ||
| Field::new("key", DataType::Int32, true), | ||
| Field::new("value", DataType::Int32, true), |
There was a problem hiding this comment.
An oversight on my end, should have been "value"
| rows.iter() | ||
| .map(|row| { | ||
| let maybe_value = row.get(field.name()); | ||
| maybe_value.and_then(|value| value.as_str()) |
There was a problem hiding this comment.
I thought value.as_str() would return an error if the binary data can't be converted to UTF8, but then we probably wouldn't be able to read the file in that case.
Just an observation, no action needed
Yeah, I've only ever base64 encoded to a string, so I wonder if there'd be some data loss if we don't do some conversion to a string representation |
|
I pushed a change to fix a |
|
Thank you all! Feeling excited to become a new contributor. |
|
Looks great -- thank you again @zijie0 -- I'll plan to include this as part of arrow 5.3.0 (eta release in about 10 days time) |
* Support binary data type in `build_struct_array`. * Modify test case. * cargo fmt Co-authored-by: Andrew Lamb <[email protected]>
I think writer needs to take care of the encoding and the reader needs to decode accordingly. For example encoding arbitrary binary into base64 is one way to do this. Another way is to encode binary as 2 bytes unicode code point. Whatever the encoding/decoding strategy gets picked, the writer is responsible to encode the binary into valid utf8 string. So it's really just a convention that the writer and reader needs to follow. |
* Support binary data type in `build_struct_array`. (#702) * Support binary data type in `build_struct_array`. * Modify test case. * cargo fmt Co-authored-by: Andrew Lamb <[email protected]> * Remove accidentally included test Co-authored-by: Yuan Zhou <[email protected]>
Which issue does this PR close?
Closes #701
Rationale for this change
We encounter this issue in delta-rs project where
partitionValuesin delta spec could contain binary type values.What changes are included in this PR?
Add binary type support and corresponding test case.
Are there any user-facing changes?
No.