ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector by romainfrancois · Pull Request #7514 · apache/arrow

romainfrancois · 2020-06-22T12:35:22Z

Going with list of raw vectors because strings can't hold nulls:

library(arrow, warn.conflicts = FALSE)

raws <- vctrs::list_of(as.raw(0:2), as.raw(0:255), .ptype = raw())
a <- Array$create(raws)
a
#> Array
#> <binary>
#> [
#>   000102,
#>   000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
#> ]
a$as_vector()
#> <list_of<raw>[2]>
#> [[1]]
#> [1] 00 01 02
#> 
#> [[2]]
#>   [1] 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18
#>  [26] 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31
#>  [51] 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a
#>  [76] 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63
#> [101] 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c
#> [126] 7d 7e 7f 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95
#> [151] 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae
#> [176] af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7
#> [201] c8 c9 ca cb cc cd ce cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0
#> [226] e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
#> [251] fa fb fc fd fe ff

# does not have to be a list_of, just have a ptype attribute
raws <- structure(list(as.raw(0:2), as.raw(0:255)), ptype = raw())
a <- Array$create(raws)
a
#> Array
#> <binary>
#> [
#>   000102,
#>   000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
#> ]

# but it makes one on the way out
a$as_vector()
#> <list_of<raw>[2]>
#> [[1]]
#> [1] 00 01 02
#> 
#> [[2]]
#>   [1] 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18
#>  [26] 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31
#>  [51] 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a
#>  [76] 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63
#> [101] 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c
#> [126] 7d 7e 7f 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95
#> [151] 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae
#> [176] af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7
#> [201] c8 c9 ca cb cc cd ce cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0
#> [226] e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
#> [251] fa fb fc fd fe ff

^{Created on 2020-06-22 by the reprex package (v0.3.0.9001)}

github-actions · 2020-06-22T12:46:41Z

https://issues.apache.org/jira/browse/ARROW-6235

fsaintjacques

Looks fine to me, but I think it would be good for 1.0 to also support other binary types, at least from arrow -> R. I'm not making this a request changes, but a strong suggestion.

fsaintjacques · 2020-06-22T15:22:39Z

r/src/array_to_vector.cpp

Since this is in your mental cache, we can accept FIXED_BINARY and LARGE_BINARY with minimal effort by using GetView instead (and some template dispatching because they don't share the same base class). LARGE_BINARY will require to bound check each value with INT32_MAX.

See also https://jira.apache.org/jira/browse/ARROW-6543

Fixed types: https://issues.apache.org/jira/browse/ARROW-9291

nealrichardson

Thanks. Could you please also update the Arrow <-> R conversion tables in vignettes/arrow.Rmd to indicate this support?

nealrichardson · 2020-06-22T17:57:26Z

r/src/array_from_vector.cpp

If I read this correctly, you're introducing general support for vctrs::list_of ptype here? IIUC then it would make sense for the ListArray to R vector conversion to set that class and attribute too then, right? (If so, feel free to add that here or defer to a followup)

Oh yeah that makes sense. Although this only looks at the attribute, not specifically that it is a vctrs_list_of but on the arrow -> R conversion, I think it does not hurt to make it a vctrs_list_of

Done. I had to modify the roundtrip checks to use expect_equivalent() because a roundtrip might add information:

list() -> List Array -> list_of( ptype = <from list array value type>)

r/src/array_from_vector.cpp

nealrichardson · 2020-06-22T17:59:47Z

r/src/array_to_vector.cpp

See also https://jira.apache.org/jira/browse/ARROW-6543

wesm

Seems like this file is going to need to get the BitBlockCounter treatment soon

r/src/array_from_vector.cpp

romainfrancois · 2020-06-26T06:36:46Z

@wesm I don't know what you mean by the BitBlockCounter treatment.

romainfrancois · 2020-06-26T11:06:29Z

Also deals with LargeBinary now, e.g. https://issues.apache.org/jira/browse/ARROW-6543

library(arrow, warn.conflicts = FALSE)

a <- Array$create(list(raw()), type = large_binary())
a
#> Array
#> <large_binary>
#> [
#>   
#> ]
a$as_vector()
#> <list_of<raw>[1]>
#> [[1]]
#> raw(0)

^{Created on 2020-06-26 by the reprex package (v0.3.0.9001)}

wesm · 2020-06-26T12:24:55Z

@wesm I don't know what you mean by the BitBlockCounter treatment.

Ah sorry, welcome back =) We created some facilities to improve the performance of processing validity bitmaps

https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_block_counter.h

The idea is that you can scan forward 64 bits at a time and determine efficiently (using hardware popcount) whether a "block" of 64 values has any nulls or not (or if the block is all null, too). For those 64 values you can skip null checking all together and use the fast path. This speeds up processing significantly for data that is mostly not null or mostly null

romainfrancois · 2020-06-26T14:19:32Z

Added support for LargeList:

library(arrow, warn.conflicts = FALSE)

a <- Array$create(list(integer()), type = large_list_of(int32()))
a
#> LargeListArray
#> <large_list<item: int32>>
#> [
#>   []
#> ]
a$as_vector()
#> <list_of<integer>[1]>
#> [[1]]
#> integer(0)

^{Created on 2020-06-26 by the reprex package (v0.3.0.9001)}

nealrichardson

Looks pretty good, just some suggestions on the tests. Thanks for adding the Large* types. There's still FixedSizeBinary and FixedSizeList to do, but we can defer them to a followup.

Please do also fill in the vignette table now that these types are supported.

r/tests/testthat/test-Array.R

nealrichardson · 2020-06-28T22:29:04Z

r/src/array_from_vector.cpp

This new builder is breaking the (new) UTF-8 tests. The previous converter code is https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L147-L171 and it is apparently no longer being called.

I wonder if this whole code block isn't possible right now as is. The "Reserve enough space before appending" block would also need to convert to UTF-8 in order to get the size right, and I wonder if converting/asserting everything to UTF-8 twice would outweigh the benefits of Reserving space. Or maybe we can take the size as is and overcommit bytes?

Tangentially related, would a "reserve" check like this be the way to solve https://issues.apache.org/jira/browse/ARROW-3308, where we need to switch to Large types if there's more than 2GB?

I'll have a look once back at this. Overall there seems to be two concurrent systems with no real reason and I believe we should only keep the once powered by VectorToArrayConverter .

I'm on my dplyr week now, I'll try still to make some space for this.

I fixed this; will deal with efficiency/redundancy in a followup.

Added a note to https://issues.apache.org/jira/browse/ARROW-7798 about the redundancy. I checked code coverage and it appears that both paths are being called, so that should be addressed (my guess is that factor levels are still going through the old).

…he value_type() of the list.

…rgeBinary arrays

Co-authored-by: Neal Richardson <[email protected]>

romainfrancois requested a review from nealrichardson June 22, 2020 12:35

fsaintjacques changed the title ~~ARROW-6235 : [R] Conversion from arrow::BinaryArray to R character vector not implemented~~ ARROW-6235 : [R] Implement conversion from arrow::BinaryArray to R character vector Jun 22, 2020

fsaintjacques approved these changes Jun 22, 2020

View reviewed changes

fsaintjacques changed the title ~~ARROW-6235 : [R] Implement conversion from arrow::BinaryArray to R character vector~~ ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector Jun 22, 2020

nealrichardson reviewed Jun 22, 2020

View reviewed changes

wesm reviewed Jun 24, 2020

View reviewed changes

r/src/array_from_vector.cpp Outdated Show resolved Hide resolved

romainfrancois requested a review from nealrichardson June 26, 2020 14:21

nealrichardson requested changes Jun 26, 2020

View reviewed changes

r/tests/testthat/test-Array.R Outdated Show resolved Hide resolved

r/tests/testthat/test-Array.R Outdated Show resolved Hide resolved

r/tests/testthat/test-Array.R Outdated Show resolved Hide resolved

nealrichardson force-pushed the ARROW-6235/List_raw branch from 299a801 to cb0721b Compare June 28, 2020 22:09

nealrichardson reviewed Jun 28, 2020

View reviewed changes

fsaintjacques self-assigned this Jun 29, 2020

romainfrancois added 14 commits July 1, 2020 08:52

R support for binary arrays

2f835b3

added degenerate test case

ba32ce3

using Rf_isNull()

34ac484

reserve before append

2bfa333

... actually Reserve()

ad8e72e

Less return()

da6fb1a

make vctrs::list_of() from List arrays to keep a ptype that matches t…

cd5e2c9

…he value_type() of the list.

additional DataType types, currently not handled

0a368aa

+ large_binary()

77e9e46

Converter_Binary becomes a template

140dc24

BinaryVectorConverter becomes a template to handle both Binary and La…

0b3051f

…rgeBinary arrays

+ large_utf8()

a24b840

string vector -> large string array

9715530

Large string -> character vector

e934e9a

romainfrancois and others added 6 commits July 1, 2020 08:53

+ StringVectorConverter<> to remove special case for strings

d00ccc2

large_list_of()

e4653d6

LargeListArray support

64f5b9f

Update r/tests/testthat/test-Array.R

a3385e4

Co-authored-by: Neal Richardson <[email protected]>

Update r/tests/testthat/test-Array.R

249bf51

Co-authored-by: Neal Richardson <[email protected]>

add as= argument to expect_array_roundtrip()

ff10327

nealrichardson force-pushed the ARROW-6235/List_raw branch from cb0721b to ff10327 Compare July 1, 2020 15:59

nealrichardson added 2 commits July 1, 2020 09:29

(Repeatedly) convert to UTF-8; tests pass

db60114

Update vignette

3dc462c

nealrichardson closed this in 447d6c2 Jul 1, 2020

This was referenced Jul 2, 2020

[R] Conversion from arrow::BinaryArray to R character vector not implemented #16846

Closed

[R] Refactor R <-> Array conversion #24031

Closed

Conversation

romainfrancois commented Jun 22, 2020

Uh oh!

github-actions bot commented Jun 22, 2020

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

romainfrancois commented Jun 26, 2020

Uh oh!

romainfrancois commented Jun 26, 2020

Uh oh!

wesm commented Jun 26, 2020

Uh oh!

romainfrancois commented Jun 26, 2020

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants