ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector#7514
ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector#7514romainfrancois wants to merge 22 commits intoapache:masterfrom
Conversation
fsaintjacques
left a comment
There was a problem hiding this comment.
Looks fine to me, but I think it would be good for 1.0 to also support other binary types, at least from arrow -> R. I'm not making this a request changes, but a strong suggestion.
r/src/array_to_vector.cpp
Outdated
There was a problem hiding this comment.
Since this is in your mental cache, we can accept FIXED_BINARY and LARGE_BINARY with minimal effort by using GetView instead (and some template dispatching because they don't share the same base class). LARGE_BINARY will require to bound check each value with INT32_MAX.
There was a problem hiding this comment.
There was a problem hiding this comment.
nealrichardson
left a comment
There was a problem hiding this comment.
Thanks. Could you please also update the Arrow <-> R conversion tables in vignettes/arrow.Rmd to indicate this support?
r/src/array_from_vector.cpp
Outdated
There was a problem hiding this comment.
If I read this correctly, you're introducing general support for vctrs::list_of ptype here? IIUC then it would make sense for the ListArray to R vector conversion to set that class and attribute too then, right? (If so, feel free to add that here or defer to a followup)
There was a problem hiding this comment.
Oh yeah that makes sense. Although this only looks at the attribute, not specifically that it is a vctrs_list_of but on the arrow -> R conversion, I think it does not hurt to make it a vctrs_list_of
There was a problem hiding this comment.
Done. I had to modify the roundtrip checks to use expect_equivalent() because a roundtrip might add information:
list() -> List Array -> list_of( ptype = <from list array value type>)
r/src/array_to_vector.cpp
Outdated
There was a problem hiding this comment.
wesm
left a comment
There was a problem hiding this comment.
Seems like this file is going to need to get the BitBlockCounter treatment soon
|
@wesm I don't know what you mean by the |
|
Also deals with LargeBinary now, e.g. https://issues.apache.org/jira/browse/ARROW-6543 library(arrow, warn.conflicts = FALSE)
a <- Array$create(list(raw()), type = large_binary())
a
#> Array
#> <large_binary>
#> [
#>
#> ]
a$as_vector()
#> <list_of<raw>[1]>
#> [[1]]
#> raw(0)Created on 2020-06-26 by the reprex package (v0.3.0.9001) |
Ah sorry, welcome back =) We created some facilities to improve the performance of processing validity bitmaps The idea is that you can scan forward 64 bits at a time and determine efficiently (using hardware popcount) whether a "block" of 64 values has any nulls or not (or if the block is all null, too). For those 64 values you can skip null checking all together and use the fast path. This speeds up processing significantly for data that is mostly not null or mostly null |
|
Added support for LargeList: library(arrow, warn.conflicts = FALSE)
a <- Array$create(list(integer()), type = large_list_of(int32()))
a
#> LargeListArray
#> <large_list<item: int32>>
#> [
#> []
#> ]
a$as_vector()
#> <list_of<integer>[1]>
#> [[1]]
#> integer(0)Created on 2020-06-26 by the reprex package (v0.3.0.9001) |
nealrichardson
left a comment
There was a problem hiding this comment.
Looks pretty good, just some suggestions on the tests. Thanks for adding the Large* types. There's still FixedSizeBinary and FixedSizeList to do, but we can defer them to a followup.
Please do also fill in the vignette table now that these types are supported.
299a801 to
cb0721b
Compare
r/src/array_from_vector.cpp
Outdated
There was a problem hiding this comment.
This new builder is breaking the (new) UTF-8 tests. The previous converter code is https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L147-L171 and it is apparently no longer being called.
I wonder if this whole code block isn't possible right now as is. The "Reserve enough space before appending" block would also need to convert to UTF-8 in order to get the size right, and I wonder if converting/asserting everything to UTF-8 twice would outweigh the benefits of Reserving space. Or maybe we can take the size as is and overcommit bytes?
Tangentially related, would a "reserve" check like this be the way to solve https://issues.apache.org/jira/browse/ARROW-3308, where we need to switch to Large types if there's more than 2GB?
There was a problem hiding this comment.
I'll have a look once back at this. Overall there seems to be two concurrent systems with no real reason and I believe we should only keep the once powered by VectorToArrayConverter .
I'm on my dplyr week now, I'll try still to make some space for this.
There was a problem hiding this comment.
I fixed this; will deal with efficiency/redundancy in a followup.
There was a problem hiding this comment.
Added a note to https://issues.apache.org/jira/browse/ARROW-7798 about the redundancy. I checked code coverage and it appears that both paths are being called, so that should be addressed (my guess is that factor levels are still going through the old).
…he value_type() of the list.
Co-authored-by: Neal Richardson <[email protected]>
Co-authored-by: Neal Richardson <[email protected]>
cb0721b to
ff10327
Compare
Going with list of raw vectors because strings can't hold nulls:
Created on 2020-06-22 by the reprex package (v0.3.0.9001)